Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-14 09:56:18


On Nov 14, 2008, at 7:41 AM, Richard Graham wrote:

> Just a few comments:
> - not sure what sort of alternative memory approach is being
> considered. The current approach was selected for two reasons:
> - If something like anonymous memory is being used, one can only
> inherit access to the shared files, so one process needs
> set up the shared memory regions, and then fork() the procs
> that will use it. This usually implies that to do this portably,
> this needs to happen inside of MPI_Init(), so up to that stage
> only one process runs on each host. Also, unrelated procs can’t
> access this memory – can’t use this in the context of Fault
> Tolerance.
> - The approach used here is very efficient for small systems, so
> alternatives should be added to what is in place here, so we
> don’t loose the performance potential on small SMP’s, which
> still describes the vast majority of systems.

I concur - however, note that the segv occurred on a 4ppn system,
which I think we would all agree constitutes a small SMP. I believe
that the alternative memory approach needs to be a separate component,
but I also believe that we need to modify the existing component so it
doesn't segv if adequate memory isn't found.

Just my $.002

>
>
> Rich
>
>
> On 11/14/08 9:22 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
>> Ok. Should be pretty easy to test/simulate to figure out what's
>> going
>> on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.
>>
>>
>> On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:
>>
>> > Until we do complete the switch, and for systems that do not
>> support
>> > the alternate type of shared memory (I believe it is only
>> Linux?), I
>> > truly believe we should do something nicer than segv.
>> >
>> > Just to clarify: I know the segv case was done with paffinity set,
>> > and believe both cases were done that way. In the first case, I was
>> > told that the segv hit when they did MPI_Send, but I did not
>> > personally verify that claim - it could be that it hit during
>> > maffinity binding if, as you suggest, we actually touch the page at
>> > that time.
>> >
>> > Ralph
>> >
>> >
>> >
>> > On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
>> >
>> >> It's been a looooong time since I've looked at the sm code; Eugene
>> >> has looked at it much more in-depth recently than I have. But I'm
>> >> guessing we *haven't* checked this stuff to abort nicely in such
>> >> error conditions. We might very well succeed in the mmap but then
>> >> segv later when the memory isn't actually available. Perhaps we
>> >> should try to touch every page first to ensure that it's actually
>> >> there...? (I'm pretty sure we do this when using paffinity to
>> >> ensure to maffinity bind memory to processors -- perhaps we're not
>> >> doing that in the !paffinity case?)
>> >>
>> >> Additionally, another solution might well be what Tim has long
>> >> advocated: switch to the other type of shared memory on systems
>> >> that support auto-pruning it when all processes die, and/or have
>> >> the orted kill it when all processes die. Then a) we're not
>> >> dependent on the filesystem free space, and b) we're not writing
>> >> all the dirty pages to disk when the processes exit.
>> >>
>> >>
>> >>
>> >> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>> >>
>> >>> Hi Eugene
>> >>>
>> >>> I too am interested - I think we need to do something about the
>> sm
>> >>> backing file situation as larger core machines are slated to
>> >>> become more prevalent shortly.
>> >>>
>> >>> I appreciate your info on the sizes and controls. One other
>> >>> question: what happens when there isn't enough memory to support
>> >>> all this? Are we smart enough to detect this situation? Does the
>> >>> sm subsystem quietly shut down? Warn and shut down? Segfault?
>> >>>
>> >>> I have two examples so far:
>> >>>
>> >>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a
>> single
>> >>> node, 2ppn, with btl=openib,sm,self. The program started, but
>> >>> segfaulted on the first MPI_Send. No warnings were printed.
>> >>>
>> >>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>> >>> (unverified - some uncertainty, could be have been much larger).
>> >>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>> >>> The program ran to completion without errors or warning. I don't
>> >>> know the communication pattern - could be no local comm was
>> >>> performed, though that sounds doubtful.
>> >>>
>> >>> If someone doesn't know, I'll have to dig into the code and
>> figure
>> >>> out the response - just hoping that someone can spare me the
>> pain.
>> >>>
>> >>> Thanks
>> >>> Ralph
>> >>>
>> >>>
>> >>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>> >>>
>> >>>> Ralph Castain wrote:
>> >>>>
>> >>>>> As has frequently been commented upon at one time or another,
>> >>>>> the shared memory backing file can be quite huge. There used
>> to
>> >>>>> be a param for controlling this size, but I can't find it in
>> >>>>> 1.3 - or at least, the name or method for controlling file
>> size
>> >>>>> has morphed into something I don't recognize.
>> >>>>>
>> >>>>> Can someone more familiar with that subsystem point me to one
>> or
>> >>>>> more params that will allow us to control the size of that
>> >>>>> file? It is swamping our systems and causing OMPI to segfault.
>> >>>>
>> >>>> Sounds like you've already gotten your answers, but I'll add my
>> >>>> $0.02 anyhow.
>> >>>>
>> >>>> The file size is the number of local processes (call it n) times
>> >>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>> >>>> mpool_sm_min_size (default 128M) and a maximum of
>> >>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those
>> >>>> parameters to control file size.
>> >>>>
>> >>>> Another issue is possibly how small a backing file you can get
>> >>>> away with. That is, just forcing the file to be smaller may not
>> >>>> be enough since your job may no longer run. The backing file
>> >>>> seems to be used mainly by:
>> >>>>
>> >>>> *) eager-fragment free lists: We start with enough eager
>> >>>> fragments so that we could have two per connection. So, you
>> >>>> could bump the sm eager size down if you need to shoehorn a job
>> >>>> into a very small backing file.
>> >>>>
>> >>>> *) large-fragment free lists: We start with 8*n large
>> >>>> fragments. If this term plagues you, you can bump the sm chunk
>> >>>> size down or reduce the value of 8 (using
>> btl_sm_free_list_num, I
>> >>>> think).
>> >>>>
>> >>>> *) FIFOs: The code tries to align a number of things on
>> pagesize
>> >>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>> >>>> here. If this term is causing you problems, you're stuck
>> (unless
>> >>>> you modify OMPI).
>> >>>>
>> >>>> I'm interested in this subject! :^)
>> >>>> _______________________________________________
>> >>>> devel mailing list
>> >>>> devel_at_[hidden]
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> devel_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >>
>> >> --
>> >> Jeff Squyres
>> >> Cisco Systems
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel