Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-11-14 09:22:34


Ok. Should be pretty easy to test/simulate to figure out what's going
on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.

On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:

> Until we do complete the switch, and for systems that do not support
> the alternate type of shared memory (I believe it is only Linux?), I
> truly believe we should do something nicer than segv.
>
> Just to clarify: I know the segv case was done with paffinity set,
> and believe both cases were done that way. In the first case, I was
> told that the segv hit when they did MPI_Send, but I did not
> personally verify that claim - it could be that it hit during
> maffinity binding if, as you suggest, we actually touch the page at
> that time.
>
> Ralph
>
>
>
> On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
>
>> It's been a looooong time since I've looked at the sm code; Eugene
>> has looked at it much more in-depth recently than I have. But I'm
>> guessing we *haven't* checked this stuff to abort nicely in such
>> error conditions. We might very well succeed in the mmap but then
>> segv later when the memory isn't actually available. Perhaps we
>> should try to touch every page first to ensure that it's actually
>> there...? (I'm pretty sure we do this when using paffinity to
>> ensure to maffinity bind memory to processors -- perhaps we're not
>> doing that in the !paffinity case?)
>>
>> Additionally, another solution might well be what Tim has long
>> advocated: switch to the other type of shared memory on systems
>> that support auto-pruning it when all processes die, and/or have
>> the orted kill it when all processes die. Then a) we're not
>> dependent on the filesystem free space, and b) we're not writing
>> all the dirty pages to disk when the processes exit.
>>
>>
>>
>> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>>
>>> Hi Eugene
>>>
>>> I too am interested - I think we need to do something about the sm
>>> backing file situation as larger core machines are slated to
>>> become more prevalent shortly.
>>>
>>> I appreciate your info on the sizes and controls. One other
>>> question: what happens when there isn't enough memory to support
>>> all this? Are we smart enough to detect this situation? Does the
>>> sm subsystem quietly shut down? Warn and shut down? Segfault?
>>>
>>> I have two examples so far:
>>>
>>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>>> node, 2ppn, with btl=openib,sm,self. The program started, but
>>> segfaulted on the first MPI_Send. No warnings were printed.
>>>
>>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>>> (unverified - some uncertainty, could be have been much larger).
>>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>>> The program ran to completion without errors or warning. I don't
>>> know the communication pattern - could be no local comm was
>>> performed, though that sounds doubtful.
>>>
>>> If someone doesn't know, I'll have to dig into the code and figure
>>> out the response - just hoping that someone can spare me the pain.
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>>
>>>> Ralph Castain wrote:
>>>>
>>>>> As has frequently been commented upon at one time or another,
>>>>> the shared memory backing file can be quite huge. There used to
>>>>> be a param for controlling this size, but I can't find it in
>>>>> 1.3 - or at least, the name or method for controlling file size
>>>>> has morphed into something I don't recognize.
>>>>>
>>>>> Can someone more familiar with that subsystem point me to one or
>>>>> more params that will allow us to control the size of that
>>>>> file? It is swamping our systems and causing OMPI to segfault.
>>>>
>>>> Sounds like you've already gotten your answers, but I'll add my
>>>> $0.02 anyhow.
>>>>
>>>> The file size is the number of local processes (call it n) times
>>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>>> mpool_sm_min_size (default 128M) and a maximum of
>>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those
>>>> parameters to control file size.
>>>>
>>>> Another issue is possibly how small a backing file you can get
>>>> away with. That is, just forcing the file to be smaller may not
>>>> be enough since your job may no longer run. The backing file
>>>> seems to be used mainly by:
>>>>
>>>> *) eager-fragment free lists: We start with enough eager
>>>> fragments so that we could have two per connection. So, you
>>>> could bump the sm eager size down if you need to shoehorn a job
>>>> into a very small backing file.
>>>>
>>>> *) large-fragment free lists: We start with 8*n large
>>>> fragments. If this term plagues you, you can bump the sm chunk
>>>> size down or reduce the value of 8 (using btl_sm_free_list_num, I
>>>> think).
>>>>
>>>> *) FIFOs: The code tries to align a number of things on pagesize
>>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>>> here. If this term is causing you problems, you're stuck (unless
>>>> you modify OMPI).
>>>>
>>>> I'm interested in this subject! :^)
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems