Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2010-06-10 04:43:54


Sylvain Jeaugey wrote:
> On Wed, 9 Jun 2010, Jeff Squyres wrote:
>
>> On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote:
>>
>>> System V shared memory cleanup is a concern only if a process dies in
>>> between shmat and shmctl IPC_RMID. Shared memory segment cleanup
>>> should happen automagically in most cases, including abnormal process
>>> termination.
>>
>> Umm... right. Duh. I knew that.
>>
>> Really.
>>
>> So -- we're good!
>>
>> Let's open the discussion of making sysv the default on systems that
>> support the IPC_RMID behavior (which, AFAIK, is only Linux)...
> I'm sorry, but I think System V has many disadvantages over mmap.
>
> 1. As discussed before, cleaning is not as easy as for a file. It is a
> good thing to remove the shm segment after creation, but since
> problems often happen during shmget/shmat, there's still a high risk
> of letting things behind.
>
> 2. There are limits in the kernel you need to grow (kernel.shmall,
> kernel.shmmax). On most linux distribution, shmmax is 32MB, which does
> not permit the sysv mechanism to work. Mmapped files are unlimited.
>
> 3. Each shm segment is identified by a 32 bit integer. This namespace
> is small (and non-intuitive, as opposed to a file name), and the
> probability for a collision is not null, especially when you start
> creating multiple shared memory segments (for collectives, one-sided
> operations, ...).
>
> So, I'm a bit reluctant to work with System V mechanisms again. I
> don't think there is a *real* reason for System V to be faster than
> mmap, since it should just be memory. I'd rather find out why mmap is
> slower.
>
> Sylvain
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

One should not ignore the option of POSIX shared memory: shm_open() and
shm_unlink(). When present this mechanism usually does not suffer from
the small (eg 32MB) limits of SysV, and uses a "filename" (in an
abstract namespace) which can portably be up 14 characters in length.
Because shm_unlink() may be called as soon as the final process has done
its shm_open() one can get approximately the safety of the IPC_RMID
mechanism, but w/o being restricted to Linux.

I have used POSIX shared memory for another project and found it works
well on Linux, Solaris (10 and Open), FreeBSD and AIX. That is probably
a narrow coverage than SysV, but still worth consideration IMHO. With
mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms for
sharing memory between processes, I think we have an argument for a
full-blown "shared pages" framework as opposed to just a "mpi_common_sm"
MCA parameter. That brings all the benefits like possibly "failing
over" from one component to another (otherwise less desired) one if some
limit is exceeded. For instance, SysV could (for a given set of
priorities) be used by default, but mmap-on-real-fs could be
automatically selected when the requested/required size exceeds the
shmmax value.

As for why mmap is slower. When the file is on a real (not tmpfs or
other ramdisk) I am 95% certain that this is an artifact of the Linux
swapper/pager behavior which is thinking it is being smart by "swapping
ahead". Even when there is no memory pressure that requires swapping,
Linux starts queuing swap I/O for pages to keep the number of "clean"
pages up when possible. This results in pages of the shared memory file
being written out to the actual block device. Both the background I/O
and the VM metadata updates contribute to the lost time. I say 95%
certain because I have a colleague who looked into this phenomena in
another setting and I am recounting what he reported as clearly as I can
remember, but might have misunderstood or inserted my own speculation by
accident. A sufficiently motivated investigator (not me) could probably
devise an experiment to verify this.

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900