It seems to me there are two extremes.
One is that you replicate the data for each process. This has the
disadvantage of consuming lots of memory "unnecessarily."
Another extreme is that shared data is distributed over all processes.
This has the disadvantage of making at least some of the data less
accessible, whether in programming complexity and/or run-time
I'm not familiar with Global Arrays. I was somewhat familiar with
HPF. I think the natural thing to do with those programming models is
to distribute data over all processes, which may relieve the excessive
memory consumption you're trying to address but which may also just put
you at a different "extreme" of this spectrum.
The middle ground I think might make most sense would be to share data
only within a node, but to replicate the data for each node. There are
probably multiple ways of doing this -- possibly even GA, I don't
know. One way might be to use one MPI process per node, with OMP
multithreading within each process|node. Or (and I thought this was
the solution you were looking for), have some idea which processes are
collocal. Have one process per node create and initialize some shared
memory -- mmap, perhaps, or SysV shared memory. Then, have its peers
map the same shared memory into their address spaces.
You asked what source code changes would be required. It depends. If
you're going to mmap shared memory in on each node, you need to know
which processes are collocal. If you're willing to constrain how
processes are mapped to nodes, this could be easy. (E.g., "every 4
processes are collocal".) If you want to discover dynamically at run
time which are collocal, it would be harder. The mmap stuff could be
in a stand-alone function of about a dozen lines. If the shared area
is allocated as one piece, substituting the single malloc() call with a
call to your mmap function should be simple. If you have many
malloc()s you're trying to replace, it's harder.
Andrei Fokau wrote:
The data are read from a file and
processed before calculations begin, so I think that mapping will not
work in our case.
Global Arrays look promising indeed. As I said, we need to put just a
part of data to the shared section. John, do you (or may be other
users) have an experience of working with GA?
When GA runs with MPI:
! start MPI
! start global arrays
! start memory allocator
.... do work
! tidy up global arrays
! tidy up MPI
! exit program
On Fri, Sep 24, 2010 at 13:44, Reuti <email@example.com>
24.09.2010 um 13:26 schrieb John Hearns:
I was also thinking of this when I read "data in a shared memory"
(besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page).
Wasn't this also one idea behind "High Performance Fortran" - running
in parallel across nodes even without knowing that it's across nodes at
all while programming and access all data like it's being local.
> On 24 September 2010 08:46, Andrei Fokau <firstname.lastname@example.org
>> We use a C-program which consumes a lot of memory per process
(up to few
>> GB), 99% of the data being the same for each process. So for
us it would be
>> quite reasonable to put that part of data in a shared memory.
> Is this eny help? Apologies if I'm talking through my hat.