Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Tim Mattox (timattox_at_[hidden])
Date: 2008-11-14 09:56:47


Rich,
The forking requirement is true if you are using anonymous mmap.
It is not true when using SYSV shm segments.
I've done this over a decade ago for a non-MPI communications library.
For Linux, this is a no-brainer... I just need time to code it up to fit within
the Open MPI infrastructure. The key "secret sauce" for Linux that
guarantees the shm segment gets cleaned up is listed in
the ticket: https://svn.open-mpi.org/trac/ompi/ticket/1320
The Linux "secret sauce" is that shmctl(shared_id, IPC_RMID, NULL); on Linux
does not remove the segment right away, just marks it for deletion
for when the last user exits.

For non-linux systems, I think SYSV shm should still be an option, but
would need to arrange for the shm segment to be cleaned up by the local orted
when things are shutting down. AFAIK, this last bit is why mmap'ed
files was originaly chosen, since even if the orted failed to removed the
mmaped file from /tmp, the only leftover was some wasted disk space (assuming
/tmp wasn't a ram disk :-). With SYSV shm on a non-linux system,
or a linux system without applying the "secret sauce", the leftover shm segment
would consume real memory that would interfere with subsequent programs.
This latter bit cause no amount of headache for our OS Lab class at Purdue in
the mid 1990's before we found the "secret sauce" in Linux. But I digress.

2008/11/14 Richard Graham <rlgraham_at_[hidden]>:
> Just a few comments:
> - not sure what sort of alternative memory approach is being considered.
> The current approach was selected for two reasons:
> - If something like anonymous memory is being used, one can only inherit
> access to the shared files, so one process needs
> set up the shared memory regions, and then fork() the procs that will
> use it. This usually implies that to do this portably,
> this needs to happen inside of MPI_Init(), so up to that stage only
> one process runs on each host. Also, unrelated procs can't
> access this memory – can't use this in the context of Fault Tolerance.
> - The approach used here is very efficient for small systems, so
> alternatives should be added to what is in place here, so we
> don't loose the performance potential on small SMP's, which still
> describes the vast majority of systems.
>
> Rich
>
>
> On 11/14/08 9:22 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
> Ok. Should be pretty easy to test/simulate to figure out what's going
> on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.
>
>
> On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:
>
>> Until we do complete the switch, and for systems that do not support
>> the alternate type of shared memory (I believe it is only Linux?), I
>> truly believe we should do something nicer than segv.
>>
>> Just to clarify: I know the segv case was done with paffinity set,
>> and believe both cases were done that way. In the first case, I was
>> told that the segv hit when they did MPI_Send, but I did not
>> personally verify that claim - it could be that it hit during
>> maffinity binding if, as you suggest, we actually touch the page at
>> that time.
>>
>> Ralph
>>
>>
>>
>> On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
>>
>>> It's been a looooong time since I've looked at the sm code; Eugene
>>> has looked at it much more in-depth recently than I have. But I'm
>>> guessing we *haven't* checked this stuff to abort nicely in such
>>> error conditions. We might very well succeed in the mmap but then
>>> segv later when the memory isn't actually available. Perhaps we
>>> should try to touch every page first to ensure that it's actually
>>> there...? (I'm pretty sure we do this when using paffinity to
>>> ensure to maffinity bind memory to processors -- perhaps we're not
>>> doing that in the !paffinity case?)
>>>
>>> Additionally, another solution might well be what Tim has long
>>> advocated: switch to the other type of shared memory on systems
>>> that support auto-pruning it when all processes die, and/or have
>>> the orted kill it when all processes die. Then a) we're not
>>> dependent on the filesystem free space, and b) we're not writing
>>> all the dirty pages to disk when the processes exit.
>>>
>>>
>>>
>>> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>>>
>>>> Hi Eugene
>>>>
>>>> I too am interested - I think we need to do something about the sm
>>>> backing file situation as larger core machines are slated to
>>>> become more prevalent shortly.
>>>>
>>>> I appreciate your info on the sizes and controls. One other
>>>> question: what happens when there isn't enough memory to support
>>>> all this? Are we smart enough to detect this situation? Does the
>>>> sm subsystem quietly shut down? Warn and shut down? Segfault?
>>>>
>>>> I have two examples so far:
>>>>
>>>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>>>> node, 2ppn, with btl=openib,sm,self. The program started, but
>>>> segfaulted on the first MPI_Send. No warnings were printed.
>>>>
>>>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>>>> (unverified - some uncertainty, could be have been much larger).
>>>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>>>> The program ran to completion without errors or warning. I don't
>>>> know the communication pattern - could be no local comm was
>>>> performed, though that sounds doubtful.
>>>>
>>>> If someone doesn't know, I'll have to dig into the code and figure
>>>> out the response - just hoping that someone can spare me the pain.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>>>
>>>>> Ralph Castain wrote:
>>>>>
>>>>>> As has frequently been commented upon at one time or another,
>>>>>> the shared memory backing file can be quite huge. There used to
>>>>>> be a param for controlling this size, but I can't find it in
>>>>>> 1.3 - or at least, the name or method for controlling file size
>>>>>> has morphed into something I don't recognize.
>>>>>>
>>>>>> Can someone more familiar with that subsystem point me to one or
>>>>>> more params that will allow us to control the size of that
>>>>>> file? It is swamping our systems and causing OMPI to segfault.
>>>>>
>>>>> Sounds like you've already gotten your answers, but I'll add my
>>>>> $0.02 anyhow.
>>>>>
>>>>> The file size is the number of local processes (call it n) times
>>>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>>>> mpool_sm_min_size (default 128M) and a maximum of
>>>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those
>>>>> parameters to control file size.
>>>>>
>>>>> Another issue is possibly how small a backing file you can get
>>>>> away with. That is, just forcing the file to be smaller may not
>>>>> be enough since your job may no longer run. The backing file
>>>>> seems to be used mainly by:
>>>>>
>>>>> *) eager-fragment free lists: We start with enough eager
>>>>> fragments so that we could have two per connection. So, you
>>>>> could bump the sm eager size down if you need to shoehorn a job
>>>>> into a very small backing file.
>>>>>
>>>>> *) large-fragment free lists: We start with 8*n large
>>>>> fragments. If this term plagues you, you can bump the sm chunk
>>>>> size down or reduce the value of 8 (using btl_sm_free_list_num, I
>>>>> think).
>>>>>
>>>>> *) FIFOs: The code tries to align a number of things on pagesize
>>>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>>>> here. If this term is causing you problems, you're stuck (unless
>>>>> you modify OMPI).
>>>>>
>>>>> I'm interested in this subject! :^)
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox_at_[hidden] || timattox_at_[hidden]
    I'm a bright... http://www.the-brights.net/