Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-05-18 04:09:10


I would go further on this : when available, putting the session directory
in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
performance.

Again, when using /dev/shm instead of the local /tmp filesystem, I get a
consistent 1-5us latency improvement on a barrier at 32 cores (on a single
node). So it may not be noticeable for everyone, but it seems faster in
all cases.

Sylvain

On Mon, 17 May 2010, Paul H. Hargrove wrote:

> Entry looks good, but could probably use an additional sentence or two like:
>
> On diskless nodes running Linux, use of /dev/shm may be an option if
> supported by your distribution. This will use an in-memory file system for
> the session directory, but will NOT result in a doubling of the memory
> consumed for the shared memory file (i.e. file system "blocks" and memory
> "pages" share a single instance).
>
> -Paul
>
> Jeff Squyres wrote:
>> How's this?
>>
>> http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance
>>
>> What's the advantage of /dev/shm? (I don't know anything about /dev/shm)
>>
>>
>> On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote:
>>
>>
>>> I agree with Paul on the fact that a FAQ update would be great on this
>>> subject. /dev/shm seems a good place to put the temporary files (when
>>> available, of course).
>>>
>>> Putting files in /dev/shm also showed better performance on our systems,
>>> even with /tmp on a local disk.
>>>
>>> Sylvain
>>>
>>> On Sun, 16 May 2010, Paul H. Hargrove wrote:
>>>
>>>
>>>> If I google "ompi sm btl performance" the top match is
>>>> http://www.open-mpi.org/faq/?category=sm
>>>>
>>>> I scanned the entire page from top to bottom and don't see any questions
>>>> of
>>>> the form
>>>> Why is SM performance slower than ...?
>>>>
>>>> The words "NFS", "network", "file system" or "filesystem" appear nowhere
>>>> on
>>>> the page. The closest I could find is
>>>>
>>>>> 7. Where is the file that sm will mmap in?
>>>>>
>>>>> The file will be in the OMPI session directory, which is typically
>>>>> something like /tmp/openmpi-sessions-myusername_at_mynodename* . The file
>>>>> itself will have the name shared_mem_pool.mynodename. For example, the
>>>>> full
>>>>> path could be
>>>>> /tmp/openmpi-sessions-myusername_at_node0_0/1543/1/shared_mem_pool.node0.
>>>>>
>>>>> To place the session directory in a non-default location, use the MCA
>>>>> parameter orte_tmpdir_base.
>>>>>
>>>> which says nothing about where one should or should not place the session
>>>> directory.
>>>>
>>>> Not having read the entire FAQ from start to end, I will not contradict
>>>> Ralph's claim that the "your SM performance might suck if you put the
>>>> session
>>>> directory on a remote filesystem" FAQ entry does exist, but I will assert
>>>> that I did not find it in the SM section of the FAQ. I tried google on
>>>> "ompi
>>>> session directory" and "ompi orte_tmpdir_base" and still didn't find
>>>> whatever
>>>> entry Ralph is talking about. So, I think the average user with no clue
>>>> about the relationship between the SM BLT and the session directory would
>>>> need some help finding it. Therefore, I still feel an FAQ entry in the
>>>> SM
>>>> category is warranted, even if it just references whatever entry Ralph is
>>>> referring to.
>>>>
>>>> -Paul
>>>>
>>>> Ralph Castain wrote:
>>>>
>>>>> We have had a FAQ on this for a long time...problem is, nobody reads it
>>>>> :-/
>>>>>
>>>>> Glad you found the problem!
>>>>>
>>>>> On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Oskar Enoksson wrote:
>>>>>>
>>>>>>
>>>>>>> Christopher Samuel wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Subject: Re: [OMPI devel] Very poor performance with btl sm on twin
>>>>>>>> nehalem servers with Mellanox ConnectX installed
>>>>>>>> To: devel_at_[hidden]
>>>>>>>> Message-ID:
>>>>>>>> <D45958078CD65C429557B4C5F492B6A60770E51F_at_[hidden]>
>>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>>>>>
>>>>>>>> On 13/05/10 20:56, Oskar Enoksson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> The problem is that I get very bad performance unless I
>>>>>>>>> explicitly exclude the "sm" btl and I can't figure out why.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Recently someone reported issues which were traced back to
>>>>>>>> the fact that the files that sm uses for mmap() were in a
>>>>>>>> /tmp which was NFS mounted; changing the location where their
>>>>>>>> files were kept to another directory with the orte_tmpdir_base
>>>>>>>> MCA parameter fixed that issue for them.
>>>>>>>>
>>>>>>>> Could it be similar for yourself ?
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>
>>>>>>> That was exactly right, as you guessed these are diskless nodes that
>>>>>>> mounts the root filesystem over NFS.
>>>>>>>
>>>>>>> Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then
>>>>>>> running mpi_stress on eight cores measures speeds of 1650MB/s for both
>>>>>>> 1MB messages and 1600MB/s for 10kB messages.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> /Oskar
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>> Sounds like a new FAQ entry is warranted.
>>>>>>
>>>>>> -Paul
>>>>>>
>>>>>> --
>>>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>>>> Future Technologies Group
>>>>>> HPC Research Department Tel: +1-510-495-2352
>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>> --
>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>> Future Technologies Group Tel: +1-510-495-2352
>>>> HPC Research Department Fax: +1-510-486-6900
>>>> Lawrence Berkeley National Laboratory
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group Tel: +1-510-495-2352
> HPC Research Department Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>