Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-06-10 04:57:54

On Thu, 10 Jun 2010, Paul H. Hargrove wrote:

> One should not ignore the option of POSIX shared memory: shm_open() and
> shm_unlink(). When present this mechanism usually does not suffer from
> the small (eg 32MB) limits of SysV, and uses a "filename" (in an
> abstract namespace) which can portably be up 14 characters in length.
> Because shm_unlink() may be called as soon as the final process has done
> its shm_open() one can get approximately the safety of the IPC_RMID
> mechanism, but w/o being restricted to Linux.
> I have used POSIX shared memory for another project and found it works
> well on Linux, Solaris (10 and Open), FreeBSD and AIX. That is probably
> a narrow coverage than SysV, but still worth consideration IMHO.
I was just doing research on shm_open() to ensure it had no limitation
before introducing it in this thread. You saved me some time !

> With mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms
> for sharing memory between processes, I think we have an argument for a
> full-blown "shared pages" framework as opposed to just a "mpi_common_sm"
> MCA parameter. That brings all the benefits like possibly "failing
> over" from one component to another (otherwise less desired) one if some
> limit is exceeded. For instance, SysV could (for a given set of
> priorities) be used by default, but mmap-on-real-fs could be
> automatically selected when the requested/required size exceeds the
> shmmax value.
Would be indeed nice.

> As for why mmap is slower. When the file is on a real (not tmpfs or other
> ramdisk) I am 95% certain that this is an artifact of the Linux swapper/pager
> behavior which is thinking it is being smart by "swapping ahead". Even when
> there is no memory pressure that requires swapping, Linux starts queuing swap
> I/O for pages to keep the number of "clean" pages up when possible. This
> results in pages of the shared memory file being written out to the actual
> block device. Both the background I/O and the VM metadata updates contribute
> to the lost time. I say 95% certain because I have a colleague who looked
> into this phenomena in another setting and I am recounting what he reported
> as clearly as I can remember, but might have misunderstood or inserted my own
> speculation by accident. A sufficiently motivated investigator (not me)
> could probably devise an experiment to verify this.
Interesting. Do you think this behavior of the linux kernel would change
if the file was unlink()ed after attach ?