Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] /dev/shm usage (was: Very poor performance with btlsm...)
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-18 10:54:19


I was reminded this morning (by 2 people :-) ) that the sysv shmem stuff was initiated a long time ago as a workaround for many of these same issues (including the potential performance issues).

Sam's work is nearly complete; I think that -- at least on Linux -- the mmap performance issues can go away. The cleanup issues will not go away; it still requires external help to *guarantee* that shared memory IDs are removed after the job has completed.

On May 18, 2010, at 8:45 AM, Jeff Squyres (jsquyres) wrote:

> Ralph and I talked about this on the phone a bit this morning. There's several complicating factors in using /dev/shm (aren't there always? :-) ).
>
> 0. Note that anything in /dev/shm will need to have session-directory-like semantics: there needs to be per-user and per-job characteristics (e.g., if the same user launches multiple jobs on the same node, etc.).
>
> 1. It is not necessarily a good idea to put the entire session directory in /dev/shm. It's not just the shared memory files that go in the session directory; there's a handful of other meta data files that go in there as well. Those files don't take up much space, but it still feels wrong to put anything other that shared memory files in there. Indeed, checkpoint files and filem files can go in there -- these can eat up lots of space (RAM).
>
> 2. /dev/shm may not be configured right, and/or there are possible /dev/shm configurations where you *do* use twice the memory (Ralph cited an example of a nameless organization that had exactly this problem -- we don't know if this was a misconfiguration or whether it was done on purpose for some reason). I don't know if kernel version comes into play here, too (e.g., if older Linux kernel versions did double the memory, or somesuch). So it's not necessarily a slam dunk that you *always* want to do this.
>
> 3. The session directory has "best effort" cleanup at the end of the job:
>
> - MPI jobs (effectively) rm -rf the session directory
> - The orted (effectively) rm -rf's the session directory
>
> But neither of these are *guaranteed* -- for example, if the resource manager kills the job with extreme prejudice, the session directory can be left around. Where possible, ORTE tries to put the session directory in a resource manager job-specific-temp directory so that the resource manager itself whacks the session directory at the end of the job. But this isn't always the case.
>
> So the session directory has 2 levels of attempted cleanup (MPI procs and orted), and sometimes a 3rd (the resource manager).
>
> 3a. If the session directory is in /dev/shm, we get the 2 levels but definitely not the 3rd (note: I don't think that putting the session directory is a good idea, per #1 -- I'm just being complete).
>
> 3b. If the shared memory files are outside the session directory, we don't get any of the additional cleanup without adding some additional infrastructure -- possibly into orte/util/session_dir.* (e.g., add /dev/shm as a secondary session directory root). This would allow us to effect session directory-like semantics inside /dev/shm.
>
> 4. But even with 2 levels of possible cleanup, not having the resource manager cleanup can be quite disastrous if shared memory is left around after a job is forcibly terminated. Sysadmins can do stuff like rm -rf /dev/shm (or whatever) between jobs to guarantee cleanup, but it would be extra steps required outside of OMPI.
>
> --> This seems to imply that using /dev/shm should not be default behavior.
>
> -----
>
> All this being said, it seems like 3b is a reasonable way to go forward: extend orte/util/session_dir.* to allow for multiple session directory roots (somehow -- exact mechanism TBD). Then both the MPI processes and the orted will try to clean up both the real session directory and /dev/shm. Both roots will use the same per user/per job kinds of characteristics that the session dir already has.
>
> Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited list of roots. It still defaults to /tmp, but a sysadmin can set it to be /tmp,/dev/shm (or whatever) if they want to use /dev/shm. OMPI will still do "best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to *guarantee* its cleanup after a job ends, etc.
>
> Thoughts?
>
>
> On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote:
>
> > I would go further on this : when available, putting the session directory
> > in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
> > performance.
> >
> > Again, when using /dev/shm instead of the local /tmp filesystem, I get a
> > consistent 1-5us latency improvement on a barrier at 32 cores (on a single
> > node). So it may not be noticeable for everyone, but it seems faster in
> > all cases.
> >
> > Sylvain
> >
> > On Mon, 17 May 2010, Paul H. Hargrove wrote:
> >
> > > Entry looks good, but could probably use an additional sentence or two like:
> > >
> > > On diskless nodes running Linux, use of /dev/shm may be an option if
> > > supported by your distribution. This will use an in-memory file system for
> > > the session directory, but will NOT result in a doubling of the memory
> > > consumed for the shared memory file (i.e. file system "blocks" and memory
> > > "pages" share a single instance).
> > >
> > > -Paul
> > >
> > > Jeff Squyres wrote:
> > >> How's this?
> > >>
> > >> http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance
> > >>
> > >> What's the advantage of /dev/shm? (I don't know anything about /dev/shm)
> > >>
> > >>
> > >> On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote:
> > >>
> > >>
> > >>> I agree with Paul on the fact that a FAQ update would be great on this
> > >>> subject. /dev/shm seems a good place to put the temporary files (when
> > >>> available, of course).
> > >>>
> > >>> Putting files in /dev/shm also showed better performance on our systems,
> > >>> even with /tmp on a local disk.
> > >>>
> > >>> Sylvain
> > >>>
> > >>> On Sun, 16 May 2010, Paul H. Hargrove wrote:
> > >>>
> > >>>
> > >>>> If I google "ompi sm btl performance" the top match is
> > >>>> http://www.open-mpi.org/faq/?category=sm
> > >>>>
> > >>>> I scanned the entire page from top to bottom and don't see any questions
> > >>>> of
> > >>>> the form
> > >>>> Why is SM performance slower than ...?
> > >>>>
> > >>>> The words "NFS", "network", "file system" or "filesystem" appear nowhere
> > >>>> on
> > >>>> the page. The closest I could find is
> > >>>>
> > >>>>> 7. Where is the file that sm will mmap in?
> > >>>>>
> > >>>>> The file will be in the OMPI session directory, which is typically
> > >>>>> something like /tmp/openmpi-sessions-myusername_at_mynodename* . The file
> > >>>>> itself will have the name shared_mem_pool.mynodename. For example, the
> > >>>>> full
> > >>>>> path could be
> > >>>>> /tmp/openmpi-sessions-myusername_at_node0_0/1543/1/shared_mem_pool.node0.
> > >>>>>
> > >>>>> To place the session directory in a non-default location, use the MCA
> > >>>>> parameter orte_tmpdir_base.
> > >>>>>
> > >>>> which says nothing about where one should or should not place the session
> > >>>> directory.
> > >>>>
> > >>>> Not having read the entire FAQ from start to end, I will not contradict
> > >>>> Ralph's claim that the "your SM performance might suck if you put the
> > >>>> session
> > >>>> directory on a remote filesystem" FAQ entry does exist, but I will assert
> > >>>> that I did not find it in the SM section of the FAQ. I tried google on
> > >>>> "ompi
> > >>>> session directory" and "ompi orte_tmpdir_base" and still didn't find
> > >>>> whatever
> > >>>> entry Ralph is talking about. So, I think the average user with no clue
> > >>>> about the relationship between the SM BLT and the session directory would
> > >>>> need some help finding it. Therefore, I still feel an FAQ entry in the
> > >>>> SM
> > >>>> category is warranted, even if it just references whatever entry Ralph is
> > >>>> referring to.
> > >>>>
> > >>>> -Paul
> > >>>>
> > >>>> Ralph Castain wrote:
> > >>>>
> > >>>>> We have had a FAQ on this for a long time...problem is, nobody reads it
> > >>>>> :-/
> > >>>>>
> > >>>>> Glad you found the problem!
> > >>>>>
> > >>>>> On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Oskar Enoksson wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> Christopher Samuel wrote:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Subject: Re: [OMPI devel] Very poor performance with btl sm on twin
> > >>>>>>>> nehalem servers with Mellanox ConnectX installed
> > >>>>>>>> To: devel_at_[hidden]
> > >>>>>>>> Message-ID:
> > >>>>>>>> <D45958078CD65C429557B4C5F492B6A60770E51F_at_[hidden]>
> > >>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
> > >>>>>>>>
> > >>>>>>>> On 13/05/10 20:56, Oskar Enoksson wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> The problem is that I get very bad performance unless I
> > >>>>>>>>> explicitly exclude the "sm" btl and I can't figure out why.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> Recently someone reported issues which were traced back to
> > >>>>>>>> the fact that the files that sm uses for mmap() were in a
> > >>>>>>>> /tmp which was NFS mounted; changing the location where their
> > >>>>>>>> files were kept to another directory with the orte_tmpdir_base
> > >>>>>>>> MCA parameter fixed that issue for them.
> > >>>>>>>>
> > >>>>>>>> Could it be similar for yourself ?
> > >>>>>>>>
> > >>>>>>>> cheers,
> > >>>>>>>> Chris
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>> That was exactly right, as you guessed these are diskless nodes that
> > >>>>>>> mounts the root filesystem over NFS.
> > >>>>>>>
> > >>>>>>> Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then
> > >>>>>>> running mpi_stress on eight cores measures speeds of 1650MB/s for both
> > >>>>>>> 1MB messages and 1600MB/s for 10kB messages.
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>> /Oskar
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> devel mailing list
> > >>>>>>> devel_at_[hidden]
> > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>>>
> > >>>>>>>
> > >>>>>> Sounds like a new FAQ entry is warranted.
> > >>>>>>
> > >>>>>> -Paul
> > >>>>>>
> > >>>>>> --
> > >>>>>> Paul H. Hargrove PHHargrove_at_[hidden]
> > >>>>>> Future Technologies Group
> > >>>>>> HPC Research Department Tel: +1-510-495-2352
> > >>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> devel mailing list
> > >>>>>> devel_at_[hidden]
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> devel mailing list
> > >>>>> devel_at_[hidden]
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>
> > >>>>>
> > >>>> --
> > >>>> Paul H. Hargrove PHHargrove_at_[hidden]
> > >>>> Future Technologies Group Tel: +1-510-495-2352
> > >>>> HPC Research Department Fax: +1-510-486-6900
> > >>>> Lawrence Berkeley National Laboratory
> > >>>> _______________________________________________
> > >>>> devel mailing list
> > >>>> devel_at_[hidden]
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>
> > >>>>
> > >>>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> devel_at_[hidden]
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> > > --
> > > Paul H. Hargrove PHHargrove_at_[hidden]
> > > Future Technologies Group Tel: +1-510-495-2352
> > > HPC Research Department Fax: +1-510-486-6900
> > > Lawrence Berkeley National Laboratory
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/