On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh <
eugene.loh@oracle.com> wrote:
> It seems to me there are two extremes.
>
> One is that you replicate the data for each process. This has the
> disadvantage of consuming lots of memory "unnecessarily."
>
> Another extreme is that shared data is distributed over all processes. This
> has the disadvantage of making at least some of the data less accessible,
> whether in programming complexity and/or run-time performance.
>
> I'm not familiar with Global Arrays. I was somewhat familiar with HPF. I
> think the natural thing to do with those programming models is to distribute
> data over all processes, which may relieve the excessive memory consumption
> you're trying to address but which may also just put you at a different
> "extreme" of this spectrum.
>
> The middle ground I think might make most sense would be to share data only
> within a node, but to replicate the data for each node. There are probably
> multiple ways of doing this -- possibly even GA, I don't know. One way
> might be to use one MPI process per node, with OMP multithreading within
> each process|node. Or (and I thought this was the solution you were looking
> for), have some idea which processes are collocal. Have one process per
> node create and initialize some shared memory -- mmap, perhaps, or SysV
> shared memory. Then, have its peers map the same shared memory into their
> address spaces.
>
> You asked what source code changes would be required. It depends. If
> you're going to mmap shared memory in on each node, you need to know which
> processes are collocal. If you're willing to constrain how processes are
> mapped to nodes, this could be easy. (E.g., "every 4 processes are
> collocal".) If you want to discover dynamically at run time which are
> collocal, it would be harder. The mmap stuff could be in a stand-alone
> function of about a dozen lines. If the shared area is allocated as one
> piece, substituting the single malloc() call with a call to your mmap
> function should be simple. If you have many malloc()s you're trying to
> replace, it's harder.
>
> Andrei Fokau wrote:
>
> The data are read from a file and processed before calculations begin, so I
> think that mapping will not work in our case.
> Global Arrays look promising indeed. As I said, we need to put just a part
> of data to the shared section. John, do you (or may be other users) have an
> experience of working with GA?
>
http://www.emsl.pnl.gov/docs/global/um/build.html
> When GA runs with MPI:
> MPI_Init(..) ! start MPI
> GA_Initialize() ! start global arrays
> MA_Init(..) ! start memory allocator
> .... do work
> GA_Terminate() ! tidy up global arrays
> MPI_Finalize() ! tidy up MPI
> ! exit program
> On Fri, Sep 24, 2010 at 13:44, Reuti <
reuti@staff.uni-marburg.de> wrote:
>>
>> Am 24.09.2010 um 13:26 schrieb John Hearns:
>>
>> > On 24 September 2010 08:46, Andrei Fokau <
andrei.fokau@neutron.kth.se>
>> > wrote:
>> >> We use a C-program which consumes a lot of memory per process (up to
>> >> few
>> >> GB), 99% of the data being the same for each process. So for us it
>> >> would be
>> >> quite reasonable to put that part of data in a shared memory.
>> >
>> >
http://www.emsl.pnl.gov/docs/global/
>> >
>> > Is this eny help? Apologies if I'm talking through my hat.
>>
>> I was also thinking of this when I read "data in a shared memory" (besides
>> approaches like
http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
>> this also one idea behind "High Performance Fortran" - running in parallel
>> across nodes even without knowing that it's across nodes at all while
>> programming and access all data like it's being local.
>