Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-14 09:14:59


On Nov 14, 2008, at 7:00 AM, Tim Mattox wrote:

> Ralph,
> Are these systems running Linux? If so, the long term solution is to
> finish ticket #1320:
> https://svn.open-mpi.org/trac/ompi/ticket/1320
> Which would eliminate the sm backing files entierly, without needing
> to reduce the size of the shared memory that is used. For systems
> where /tmp is a ramdisk, the current scheme is very wasteful (less
> so if you are using tmpfs).

I agree - I think this needs to be bumped up in priority. I'm willing
to help, if that would be useful

>
>
> What kind of ramdisk are you using? If you are not using tmpfs,
> you should consider switching to tmpfs, since it allows you to have
> an arbitrarily large /tmp, yet only uses as much RAM as there
> are files in /tmp. See this for a good howto/intro:
> http://www.ibm.com/developerworks/library/l-fs3.html

I honestly don't know, and have no control over how it is setup...nor
any influence in that regard! :-)

>
>
> On Fri, Nov 14, 2008 at 8:42 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Hi Eugene
>>
>> I too am interested - I think we need to do something about the sm
>> backing
>> file situation as larger core machines are slated to become more
>> prevalent
>> shortly.
>>
>> I appreciate your info on the sizes and controls. One other
>> question: what
>> happens when there isn't enough memory to support all this? Are we
>> smart
>> enough to detect this situation? Does the sm subsystem quietly shut
>> down?
>> Warn and shut down? Segfault?
>>
>> I have two examples so far:
>>
>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>> node,
>> 2ppn, with btl=openib,sm,self. The program started, but segfaulted
>> on the
>> first MPI_Send. No warnings were printed.
>>
>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>> (unverified - some
>> uncertainty, could be have been much larger). OMPI was run on
>> multiple
>> nodes, 16ppn, with btl=openib,sm,self. The program ran to
>> completion without
>> errors or warning. I don't know the communication pattern - could
>> be no
>> local comm was performed, though that sounds doubtful.
>>
>> If someone doesn't know, I'll have to dig into the code and figure
>> out the
>> response - just hoping that someone can spare me the pain.
>>
>> Thanks
>> Ralph
>>
>>
>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>
>>> Ralph Castain wrote:
>>>
>>>> As has frequently been commented upon at one time or another,
>>>> the shared
>>>> memory backing file can be quite huge. There used to be a param
>>>> for
>>>> controlling this size, but I can't find it in 1.3 - or at least,
>>>> the name
>>>> or method for controlling file size has morphed into something I
>>>> don't
>>>> recognize.
>>>>
>>>> Can someone more familiar with that subsystem point me to one or
>>>> more
>>>> params that will allow us to control the size of that file? It
>>>> is swamping
>>>> our systems and causing OMPI to segfault.
>>>
>>> Sounds like you've already gotten your answers, but I'll add my
>>> $0.02
>>> anyhow.
>>>
>>> The file size is the number of local processes (call it n) times
>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>> mpool_sm_min_size (default 128M) and a maximum of
>>> mpool_sm_max_size (default
>>> 2G? 256M?). So, you can tweak those parameters to control file
>>> size.
>>>
>>> Another issue is possibly how small a backing file you can get
>>> away with.
>>> That is, just forcing the file to be smaller may not be enough
>>> since your
>>> job may no longer run. The backing file seems to be used mainly by:
>>>
>>> *) eager-fragment free lists: We start with enough eager
>>> fragments so
>>> that we could have two per connection. So, you could bump the sm
>>> eager size
>>> down if you need to shoehorn a job into a very small backing file.
>>>
>>> *) large-fragment free lists: We start with 8*n large fragments.
>>> If this
>>> term plagues you, you can bump the sm chunk size down or reduce
>>> the value of
>>> 8 (using btl_sm_free_list_num, I think).
>>>
>>> *) FIFOs: The code tries to align a number of things on pagesize
>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>> here. If this
>>> term is causing you problems, you're stuck (unless you modify OMPI).
>>>
>>> I'm interested in this subject! :^)
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> tmattox_at_[hidden] || timattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel