Until we do complete the switch, and for systems that do not support
the alternate type of shared memory (I believe it is only Linux?), I
truly believe we should do something nicer than segv.
Just to clarify: I know the segv case was done with paffinity set, and
believe both cases were done that way. In the first case, I was told
that the segv hit when they did MPI_Send, but I did not personally
verify that claim - it could be that it hit during maffinity binding
if, as you suggest, we actually touch the page at that time.
On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
> It's been a looooong time since I've looked at the sm code; Eugene
> has looked at it much more in-depth recently than I have. But I'm
> guessing we *haven't* checked this stuff to abort nicely in such
> error conditions. We might very well succeed in the mmap but then
> segv later when the memory isn't actually available. Perhaps we
> should try to touch every page first to ensure that it's actually
> there...? (I'm pretty sure we do this when using paffinity to
> ensure to maffinity bind memory to processors -- perhaps we're not
> doing that in the !paffinity case?)
> Additionally, another solution might well be what Tim has long
> advocated: switch to the other type of shared memory on systems that
> support auto-pruning it when all processes die, and/or have the
> orted kill it when all processes die. Then a) we're not dependent
> on the filesystem free space, and b) we're not writing all the dirty
> pages to disk when the processes exit.
> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>> Hi Eugene
>> I too am interested - I think we need to do something about the sm
>> backing file situation as larger core machines are slated to become
>> more prevalent shortly.
>> I appreciate your info on the sizes and controls. One other
>> question: what happens when there isn't enough memory to support
>> all this? Are we smart enough to detect this situation? Does the sm
>> subsystem quietly shut down? Warn and shut down? Segfault?
>> I have two examples so far:
>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>> node, 2ppn, with btl=openib,sm,self. The program started, but
>> segfaulted on the first MPI_Send. No warnings were printed.
>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>> (unverified - some uncertainty, could be have been much larger).
>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self. The
>> program ran to completion without errors or warning. I don't know
>> the communication pattern - could be no local comm was performed,
>> though that sounds doubtful.
>> If someone doesn't know, I'll have to dig into the code and figure
>> out the response - just hoping that someone can spare me the pain.
>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>> Ralph Castain wrote:
>>>> As has frequently been commented upon at one time or another,
>>>> the shared memory backing file can be quite huge. There used to
>>>> be a param for controlling this size, but I can't find it in 1.3
>>>> - or at least, the name or method for controlling file size has
>>>> morphed into something I don't recognize.
>>>> Can someone more familiar with that subsystem point me to one or
>>>> more params that will allow us to control the size of that file?
>>>> It is swamping our systems and causing OMPI to segfault.
>>> Sounds like you've already gotten your answers, but I'll add my
>>> $0.02 anyhow.
>>> The file size is the number of local processes (call it n) times
>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>> mpool_sm_min_size (default 128M) and a maximum of
>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those
>>> parameters to control file size.
>>> Another issue is possibly how small a backing file you can get
>>> away with. That is, just forcing the file to be smaller may not
>>> be enough since your job may no longer run. The backing file
>>> seems to be used mainly by:
>>> *) eager-fragment free lists: We start with enough eager
>>> fragments so that we could have two per connection. So, you could
>>> bump the sm eager size down if you need to shoehorn a job into a
>>> very small backing file.
>>> *) large-fragment free lists: We start with 8*n large fragments.
>>> If this term plagues you, you can bump the sm chunk size down or
>>> reduce the value of 8 (using btl_sm_free_list_num, I think).
>>> *) FIFOs: The code tries to align a number of things on pagesize
>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>> here. If this term is causing you problems, you're stuck (unless
>>> you modify OMPI).
>>> I'm interested in this subject! :^)
>>> devel mailing list
>> devel mailing list
> Jeff Squyres
> Cisco Systems
> devel mailing list