It's been a looooong time since I've looked at the sm code; Eugene has
looked at it much more in-depth recently than I have. But I'm
guessing we *haven't* checked this stuff to abort nicely in such error
conditions. We might very well succeed in the mmap but then segv
later when the memory isn't actually available. Perhaps we should try
to touch every page first to ensure that it's actually there...? (I'm
pretty sure we do this when using paffinity to ensure to maffinity
bind memory to processors -- perhaps we're not doing that in the !
Additionally, another solution might well be what Tim has long
advocated: switch to the other type of shared memory on systems that
support auto-pruning it when all processes die, and/or have the orted
kill it when all processes die. Then a) we're not dependent on the
filesystem free space, and b) we're not writing all the dirty pages to
disk when the processes exit.
On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
> Hi Eugene
> I too am interested - I think we need to do something about the sm
> backing file situation as larger core machines are slated to become
> more prevalent shortly.
> I appreciate your info on the sizes and controls. One other
> question: what happens when there isn't enough memory to support all
> this? Are we smart enough to detect this situation? Does the sm
> subsystem quietly shut down? Warn and shut down? Segfault?
> I have two examples so far:
> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
> node, 2ppn, with btl=openib,sm,self. The program started, but
> segfaulted on the first MPI_Send. No warnings were printed.
> 2. again with a ramdisk, /tmp was reportedly set to 16MB (unverified
> - some uncertainty, could be have been much larger). OMPI was run on
> multiple nodes, 16ppn, with btl=openib,sm,self. The program ran to
> completion without errors or warning. I don't know the communication
> pattern - could be no local comm was performed, though that sounds
> If someone doesn't know, I'll have to dig into the code and figure
> out the response - just hoping that someone can spare me the pain.
> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>> Ralph Castain wrote:
>>> As has frequently been commented upon at one time or another, the
>>> shared memory backing file can be quite huge. There used to be a
>>> param for controlling this size, but I can't find it in 1.3 - or
>>> at least, the name or method for controlling file size has
>>> morphed into something I don't recognize.
>>> Can someone more familiar with that subsystem point me to one or
>>> more params that will allow us to control the size of that file?
>>> It is swamping our systems and causing OMPI to segfault.
>> Sounds like you've already gotten your answers, but I'll add my
>> $0.02 anyhow.
>> The file size is the number of local processes (call it n) times
>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>> mpool_sm_min_size (default 128M) and a maximum of mpool_sm_max_size
>> (default 2G? 256M?). So, you can tweak those parameters to
>> control file size.
>> Another issue is possibly how small a backing file you can get away
>> with. That is, just forcing the file to be smaller may not be
>> enough since your job may no longer run. The backing file seems to
>> be used mainly by:
>> *) eager-fragment free lists: We start with enough eager fragments
>> so that we could have two per connection. So, you could bump the
>> sm eager size down if you need to shoehorn a job into a very small
>> backing file.
>> *) large-fragment free lists: We start with 8*n large fragments.
>> If this term plagues you, you can bump the sm chunk size down or
>> reduce the value of 8 (using btl_sm_free_list_num, I think).
>> *) FIFOs: The code tries to align a number of things on pagesize
>> boundaries, so you end up with about 3*n*n*pagesize overhead here.
>> If this term is causing you problems, you're stuck (unless you
>> modify OMPI).
>> I'm interested in this subject! :^)
>> devel mailing list
> devel mailing list