Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Richard Graham (rlgraham_at_[hidden])
Date: 2008-11-14 11:13:35


Agreed.

On 11/14/08 9:56 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:

>
> On Nov 14, 2008, at 7:41 AM, Richard Graham wrote:
>
>> Just a few comments:
>> - not sure what sort of alternative memory approach is being considered.
>> The current approach was selected for two reasons:
>> - If something like anonymous memory is being used, one can only inherit
>> access to the shared files, so one process needs
>> set up the shared memory regions, and then fork() the procs that will
>> use it. This usually implies that to do this portably,
>> this needs to happen inside of MPI_Init(), so up to that stage only
>> one process runs on each host. Also, unrelated procs can¹t
>> access this memory ­ can¹t use this in the context of Fault Tolerance.
>> - The approach used here is very efficient for small systems, so
>> alternatives should be added to what is in place here, so we
>> don¹t loose the performance potential on small SMP¹s, which still
>> describes the vast majority of systems.
>
> I concur - however, note that the segv occurred on a 4ppn system, which I
> think we would all agree constitutes a small SMP. I believe that the
> alternative memory approach needs to be a separate component, but I also
> believe that we need to modify the existing component so it doesn't segv if
> adequate memory isn't found.
>
> Just my $.002
>
>>
>>
>> Rich
>>
>>
>> On 11/14/08 9:22 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>
>>> Ok. Should be pretty easy to test/simulate to figure out what's going
>>> on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.
>>>
>>>
>>> On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:
>>>
>>>> > Until we do complete the switch, and for systems that do not support
>>>> > the alternate type of shared memory (I believe it is only Linux?), I
>>>> > truly believe we should do something nicer than segv.
>>>> >
>>>> > Just to clarify: I know the segv case was done with paffinity set,
>>>> > and believe both cases were done that way. In the first case, I was
>>>> > told that the segv hit when they did MPI_Send, but I did not
>>>> > personally verify that claim - it could be that it hit during
>>>> > maffinity binding if, as you suggest, we actually touch the page at
>>>> > that time.
>>>> >
>>>> > Ralph
>>>> >
>>>> >
>>>> >
>>>> > On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
>>>> >
>>>>> >> It's been a looooong time since I've looked at the sm code; Eugene
>>>>> >> has looked at it much more in-depth recently than I have. But I'm
>>>>> >> guessing we *haven't* checked this stuff to abort nicely in such
>>>>> >> error conditions. We might very well succeed in the mmap but then
>>>>> >> segv later when the memory isn't actually available. Perhaps we
>>>>> >> should try to touch every page first to ensure that it's actually
>>>>> >> there...? (I'm pretty sure we do this when using paffinity to
>>>>> >> ensure to maffinity bind memory to processors -- perhaps we're not
>>>>> >> doing that in the !paffinity case?)
>>>>> >>
>>>>> >> Additionally, another solution might well be what Tim has long
>>>>> >> advocated: switch to the other type of shared memory on systems
>>>>> >> that support auto-pruning it when all processes die, and/or have
>>>>> >> the orted kill it when all processes die. Then a) we're not
>>>>> >> dependent on the filesystem free space, and b) we're not writing
>>>>> >> all the dirty pages to disk when the processes exit.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>>>>> >>
>>>>>> >>> Hi Eugene
>>>>>> >>>
>>>>>> >>> I too am interested - I think we need to do something about the sm
>>>>>> >>> backing file situation as larger core machines are slated to
>>>>>> >>> become more prevalent shortly.
>>>>>> >>>
>>>>>> >>> I appreciate your info on the sizes and controls. One other
>>>>>> >>> question: what happens when there isn't enough memory to support
>>>>>> >>> all this? Are we smart enough to detect this situation? Does the
>>>>>> >>> sm subsystem quietly shut down? Warn and shut down? Segfault?
>>>>>> >>>
>>>>>> >>> I have two examples so far:
>>>>>> >>>
>>>>>> >>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>>>>>> >>> node, 2ppn, with btl=openib,sm,self. The program started, but
>>>>>> >>> segfaulted on the first MPI_Send. No warnings were printed.
>>>>>> >>>
>>>>>> >>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>>>>>> >>> (unverified - some uncertainty, could be have been much larger).
>>>>>> >>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>>>>>> >>> The program ran to completion without errors or warning. I don't
>>>>>> >>> know the communication pattern - could be no local comm was
>>>>>> >>> performed, though that sounds doubtful.
>>>>>> >>>
>>>>>> >>> If someone doesn't know, I'll have to dig into the code and figure
>>>>>> >>> out the response - just hoping that someone can spare me the pain.
>>>>>> >>>
>>>>>> >>> Thanks
>>>>>> >>> Ralph
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>>>>> >>>
>>>>>>> >>>> Ralph Castain wrote:
>>>>>>> >>>>
>>>>>>>> >>>>> As has frequently been commented upon at one time or another,
>>>>>>>> >>>>> the shared memory backing file can be quite huge. There used to
>>>>>>>> >>>>> be a param for controlling this size, but I can't find it in
>>>>>>>> >>>>> 1.3 - or at least, the name or method for controlling file size
>>>>>>>> >>>>> has morphed into something I don't recognize.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Can someone more familiar with that subsystem point me to one or
>>>>>>>> >>>>> more params that will allow us to control the size of that
>>>>>>>> >>>>> file? It is swamping our systems and causing OMPI to segfault.
>>>>>>> >>>>
>>>>>>> >>>> Sounds like you've already gotten your answers, but I'll add my
>>>>>>> >>>> $0.02 anyhow.
>>>>>>> >>>>
>>>>>>> >>>> The file size is the number of local processes (call it n) times
>>>>>>> >>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>>>>>> >>>> mpool_sm_min_size (default 128M) and a maximum of
>>>>>>> >>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those
>>>>>>> >>>> parameters to control file size.
>>>>>>> >>>>
>>>>>>> >>>> Another issue is possibly how small a backing file you can get
>>>>>>> >>>> away with. That is, just forcing the file to be smaller may not
>>>>>>> >>>> be enough since your job may no longer run. The backing file
>>>>>>> >>>> seems to be used mainly by:
>>>>>>> >>>>
>>>>>>> >>>> *) eager-fragment free lists: We start with enough eager
>>>>>>> >>>> fragments so that we could have two per connection. So, you
>>>>>>> >>>> could bump the sm eager size down if you need to shoehorn a job
>>>>>>> >>>> into a very small backing file.
>>>>>>> >>>>
>>>>>>> >>>> *) large-fragment free lists: We start with 8*n large
>>>>>>> >>>> fragments. If this term plagues you, you can bump the sm chunk
>>>>>>> >>>> size down or reduce the value of 8 (using btl_sm_free_list_num, I
>>>>>>> >>>> think).
>>>>>>> >>>>
>>>>>>> >>>> *) FIFOs: The code tries to align a number of things on pagesize
>>>>>>> >>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>>>>>> >>>> here. If this term is causing you problems, you're stuck (unless
>>>>>>> >>>> you modify OMPI).
>>>>>>> >>>>
>>>>>>> >>>> I'm interested in this subject! :^)
>>>>>>> >>>> _______________________________________________
>>>>>>> >>>> devel mailing list
>>>>>>> >>>> devel_at_[hidden]
>>>>>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> >>>
>>>>>> >>> _______________________________________________
>>>>>> >>> devel mailing list
>>>>>> >>> devel_at_[hidden]
>>>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Jeff Squyres
>>>>> >> Cisco Systems
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> devel mailing list
>>>>> >> devel_at_[hidden]
>>>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel