Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks
From: Sébastien Boisvert (sebastien.boisvert.3_at_[hidden])
Date: 2011-11-09 12:28:27


Hello,

We did more tests concerning the latency using 512 MPI ranks
on our super-computer. (64 machines * 8 cores per machine)

By default in Ray, any rank can communicate directly with any other.

Thus we have a complete graph with 512 vertices and 130816 edges (512*511/2)
where vertices are ranks and edges are communication links.

When a rank sends to itself, the route length is 0 edge. Otherwise, the
route
length is 1 edge. However, 130816 is a lot of edges.

With this, the average latency in microseconds when requesting a reply
for a message of 4000 bytes
with Ray on our super-computer is 386 microseconds (standard deviation: 9).

Recently, Jeff Squyres highlighted that using such a communication pattern
is not recommended and "there are a bunch of different options you can
pursue."

See http://www.open-mpi.org/community/lists/devel/2011/09/9773.php

By pursuing different options, we reduced the latency to 158 microseconds
(standard deviation: 15). This is a drop of 59%.
To do so, we added a transparent message router in Ray.

First a random graph is created with n vertices and n*log2(n)/2 randomly
selected edges
from the n*(n-1)/2 edges. The idea is that, on average, any rank has a
degree of log2(n)
instead of n.
With 512 ranks, this random graph has 2304 edges (512*9/2),
down from 130816 edges.

Note that this is not a 9-regular graph (not all vertices have a degree
of 9),
but the average is 9.

Then , shortest routes are computed with Dijkstra's algorithm modified
to choose the less saturated route if more then one have
the same length.

The route lengths are quite small.

    Frequencies:
         0 512 0.195312% # send a message to itself
         1 4608 1.75781% # send a message to a directly
connected rank
         2 37644 14.36%
         3 152972 58.3542% # most of them
         4 65710 25.0664%
         5 698 0.266266%

So my question is:

Does that indicate where the real problem is on our super-computer ?

Thanks a lot !

Also, would transparent message routing be easy to implement
directly in Open-MPI as a component ?

Sébastien Boisvert
http://boisvert.info

On 26/09/11 08:46 AM, Yevgeny Kliteynik wrote:
> On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote:
>
>> On 22-Sep-11 12:09 AM, Jeff Squyres wrote:
>>
>>> On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:
>>>
>>>
>>>>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N ibv_rc_pingpongs?
>>>>>
>>>> With 11 ibv_rc_pingpong's
>>>>
>>>> http://pastebin.com/85sPcA47
>>>>
>>>> Code to do that => https://gist.github.com/1233173
>>>>
>>>> Latencies are around 20 microseconds.
>>>>
>>> This seems to imply that the network is to blame for the higher latency...?
>>>
>> Interesting... I'm getting the same latency with ibv_rc_pingpong.
>> I get 8.5 usec for a single ping-pong.
>>
> BTW, I've just checked this with performance guys - ibv_rc_pingpong
> is not used for performance measurement but only as IB network
> sanity check, therefore it was never meant to give optimal performance.
>
> Use ib_write_lat instead.
>
> -- YK
>
>
>> Please run 'ibclearcounters' to reset fabric counters, then
>> ibdiagnet to make sure that the fabric is clean.
>> If you have 4x QDR cluster, run ibdiagnet as follows:
>>
>> ibdiagnet --ls 10 --lw 4x
>>
>> Check that you don't have any errors/warnings.
>>
>> Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
>> Just replace the command in the script and the rest would be fine.
>>
>> If the fabric is clean, you're supposed to get typical
>> latency of ~1.4 usec.
>>
>> -- YK
>>
>>
>>
>>> I.e., if you run the same pattern with MPI processes and get 20us latency, that would tend to imply that the network itself is not performing well with that IO pattern.
>>>
>>>
>>>> My job seems to do well so far with ofud !
>>>>
>>>> [sboisver12_at_colosse2 ray]$ qstat
>>>> job-ID prior name user state submit/start at queue slots ja-task-ID
>>>> -----------------------------------------------------------------------------------------------------------------
>>>> 3047460 0.55384 fish-Assem sboisver12 r 09/21/2011 15:02:25 med_at_r104-n58 256
>>>>
>>> I would still be suspicious -- ofud is not well tested, and it can definitely hang if there are network drops.
>>>
>>>
>>
>