Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Shared memory
From: Andrei Fokau (andrei.fokau_at_[hidden])
Date: 2010-10-06 10:08:33


Currently we run a code on a cluster with distributed memory, and this code
needs a lot of memory. Part of the data stored in memory is the same for
each process, but it is organized as one array - we can split it if
necessary. So far no magic occurred for us. What do we need to do to make
the magic working?

On Wed, Oct 6, 2010 at 12:43, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]>wrote:

> Open MPI will use shared memory to communicate between peers on the sane
> node - but that's hidden beneath the covers; it's not exposed via the MPI
> API. You just MPI-send and magic occurs and the receiver gets the message.
>
> On Oct 4, 2010, at 11:13 AM, "Andrei Fokau" <andrei.fokau_at_[hidden]>
> wrote:
>
> Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
> How can I use them?
>
> On Sat, Sep 25, 2010 at 23:19, Andrei Fokau <<andrei.fokau_at_[hidden]>
> andrei.fokau_at_[hidden]> wrote:
>
>> Here are some more details about our problem. We use a dozen of
>> 4-processor nodes with 8 GB memory on each node. The code we run needs about
>> 3 GB per processor, so we can load only 2 processors out of 4. The vast
>> majority of those 3 GB is the same for each processor and is
>> accessed continuously during calculation. In my original question I wasn't
>> very clear asking about a possibility to use shared memory with Open MPI -
>> in our case we do not need to have a remote access to the data, and it
>> would be sufficient to share memory within each node only.
>>
>> Of course, the possibility to access the data remotely (via mmap) is
>> attractive because it would allow to store much larger arrays (up to 10 GB)
>> at one remote place, meaning higher accuracy for our calculations. However,
>> I believe that the access time would be too long for the data read so
>> frequently, and therefore the performance would be lost.
>>
>> I still hope that some of the subscribers to this mailing list have an
>> experience of using Global Arrays. This library seems to be fine for our
>> case, however I feel that there should be a simpler solution. Open MPI
>> conforms with MPI-2 standard, and the later has a description of shared
>> memory application. Do you see any other way for us to use shared memory
>> (within node) apart of using Global Arrays?
>>
>> On Fri, Sep 24, 2010 at 19:03, Durga Choudhury < <dpchoudh_at_[hidden]>
>> dpchoudh_at_[hidden]> wrote:
>>
>>> I think the 'middle ground' approach can be simplified even further if
>>> the data file is in a shared device (e.g. NFS/Samba mount) that can be
>>> mounted at the same location of the file system tree on all nodes. I
>>> have never tried it, though and mmap()'ing a non-POSIX compliant file
>>> system such as Samba might have issues I am unaware of.
>>>
>>> However, I do not see why you should not be able to do this even if
>>> the file is being written to as long as you call msync() before using
>>> the mapped pages.
>>>
>>> Durga
>>>
>>>
>>> On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh < <eugene.loh_at_[hidden]>
>>> eugene.loh_at_[hidden]> wrote:
>>> > It seems to me there are two extremes.
>>> >
>>> > One is that you replicate the data for each process. This has the
>>> > disadvantage of consuming lots of memory "unnecessarily."
>>> >
>>> > Another extreme is that shared data is distributed over all processes.
>>> This
>>> > has the disadvantage of making at least some of the data less
>>> accessible,
>>> > whether in programming complexity and/or run-time performance.
>>> >
>>> > I'm not familiar with Global Arrays. I was somewhat familiar with
>>> HPF. I
>>> > think the natural thing to do with those programming models is to
>>> distribute
>>> > data over all processes, which may relieve the excessive memory
>>> consumption
>>> > you're trying to address but which may also just put you at a different
>>> > "extreme" of this spectrum.
>>> >
>>> > The middle ground I think might make most sense would be to share data
>>> only
>>> > within a node, but to replicate the data for each node. There are
>>> probably
>>> > multiple ways of doing this -- possibly even GA, I don't know. One way
>>> > might be to use one MPI process per node, with OMP multithreading
>>> within
>>> > each process|node. Or (and I thought this was the solution you were
>>> looking
>>> > for), have some idea which processes are collocal. Have one process
>>> per
>>> > node create and initialize some shared memory -- mmap, perhaps, or SysV
>>> > shared memory. Then, have its peers map the same shared memory into
>>> their
>>> > address spaces.
>>> >
>>> > You asked what source code changes would be required. It depends. If
>>> > you're going to mmap shared memory in on each node, you need to know
>>> which
>>> > processes are collocal. If you're willing to constrain how processes
>>> are
>>> > mapped to nodes, this could be easy. (E.g., "every 4 processes are
>>> > collocal".) If you want to discover dynamically at run time which are
>>> > collocal, it would be harder. The mmap stuff could be in a stand-alone
>>> > function of about a dozen lines. If the shared area is allocated as
>>> one
>>> > piece, substituting the single malloc() call with a call to your mmap
>>> > function should be simple. If you have many malloc()s you're trying to
>>> > replace, it's harder.
>>> >
>>> > Andrei Fokau wrote:
>>> >
>>> > The data are read from a file and processed before calculations begin,
>>> so I
>>> > think that mapping will not work in our case.
>>> > Global Arrays look promising indeed. As I said, we need to put just a
>>> part
>>> > of data to the shared section. John, do you (or may be other users)
>>> have an
>>> > experience of working with GA?
>>> > <http://www.emsl.pnl.gov/docs/global/um/build.html>
>>> http://www.emsl.pnl.gov/docs/global/um/build.html
>>> > When GA runs with MPI:
>>> > MPI_Init(..) ! start MPI
>>> > GA_Initialize() ! start global arrays
>>> > MA_Init(..) ! start memory allocator
>>> > .... do work
>>> > GA_Terminate() ! tidy up global arrays
>>> > MPI_Finalize() ! tidy up MPI
>>> > ! exit program
>>> > On Fri, Sep 24, 2010 at 13:44, Reuti < <reuti_at_[hidden]>
>>> reuti_at_[hidden]> wrote:
>>> >>
>>> >> Am 24.09.2010 um 13:26 schrieb John Hearns:
>>> >>
>>> >> > On 24 September 2010 08:46, Andrei Fokau <<andrei.fokau_at_[hidden]>
>>> andrei.fokau_at_[hidden]>
>>> >> > wrote:
>>> >> >> We use a C-program which consumes a lot of memory per process (up
>>> to
>>> >> >> few
>>> >> >> GB), 99% of the data being the same for each process. So for us it
>>> >> >> would be
>>> >> >> quite reasonable to put that part of data in a shared memory.
>>> >> >
>>> >> > <http://www.emsl.pnl.gov/docs/global/>
>>> http://www.emsl.pnl.gov/docs/global/
>>> >> >
>>> >> > Is this eny help? Apologies if I'm talking through my hat.
>>> >>
>>> >> I was also thinking of this when I read "data in a shared memory"
>>> (besides
>>> >> approaches like <http://www.kerrighed.org/wiki/index.php/Main_Page>
>>> http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
>>> >> this also one idea behind "High Performance Fortran" - running in
>>> parallel
>>> >> across nodes even without knowing that it's across nodes at all while
>>> >> programming and access all data like it's being local.
>>> >
>>>
>>>