Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks
From: Sébastien Boisvert (sebastien.boisvert.3_at_[hidden])
Date: 2011-11-09 12:28:27


We did more tests concerning the latency using 512 MPI ranks
on our super-computer. (64 machines * 8 cores per machine)

By default in Ray, any rank can communicate directly with any other.

Thus we have a complete graph with 512 vertices and 130816 edges (512*511/2)
where vertices are ranks and edges are communication links.

When a rank sends to itself, the route length is 0 edge. Otherwise, the
length is 1 edge. However, 130816 is a lot of edges.

With this, the average latency in microseconds when requesting a reply
for a message of 4000 bytes
with Ray on our super-computer is 386 microseconds (standard deviation: 9).

Recently, Jeff Squyres highlighted that using such a communication pattern
is not recommended and "there are a bunch of different options you can


By pursuing different options, we reduced the latency to 158 microseconds
(standard deviation: 15). This is a drop of 59%.
To do so, we added a transparent message router in Ray.

First a random graph is created with n vertices and n*log2(n)/2 randomly
selected edges
from the n*(n-1)/2 edges. The idea is that, on average, any rank has a
degree of log2(n)
instead of n.
With 512 ranks, this random graph has 2304 edges (512*9/2),
down from 130816 edges.

Note that this is not a 9-regular graph (not all vertices have a degree
of 9),
but the average is 9.

Then , shortest routes are computed with Dijkstra's algorithm modified
to choose the less saturated route if more then one have
the same length.

The route lengths are quite small.

         0 512 0.195312% # send a message to itself
         1 4608 1.75781% # send a message to a directly
connected rank
         2 37644 14.36%
         3 152972 58.3542% # most of them
         4 65710 25.0664%
         5 698 0.266266%

So my question is:

Does that indicate where the real problem is on our super-computer ?

Thanks a lot !

Also, would transparent message routing be easy to implement
directly in Open-MPI as a component ?

Sébastien Boisvert

On 26/09/11 08:46 AM, Yevgeny Kliteynik wrote:
> On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote:
>> On 22-Sep-11 12:09 AM, Jeff Squyres wrote:
>>> On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:
>>>>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N ibv_rc_pingpongs?
>>>> With 11 ibv_rc_pingpong's
>>>> Code to do that =>
>>>> Latencies are around 20 microseconds.
>>> This seems to imply that the network is to blame for the higher latency...?
>> Interesting... I'm getting the same latency with ibv_rc_pingpong.
>> I get 8.5 usec for a single ping-pong.
> BTW, I've just checked this with performance guys - ibv_rc_pingpong
> is not used for performance measurement but only as IB network
> sanity check, therefore it was never meant to give optimal performance.
> Use ib_write_lat instead.
> -- YK
>> Please run 'ibclearcounters' to reset fabric counters, then
>> ibdiagnet to make sure that the fabric is clean.
>> If you have 4x QDR cluster, run ibdiagnet as follows:
>> ibdiagnet --ls 10 --lw 4x
>> Check that you don't have any errors/warnings.
>> Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
>> Just replace the command in the script and the rest would be fine.
>> If the fabric is clean, you're supposed to get typical
>> latency of ~1.4 usec.
>> -- YK
>>> I.e., if you run the same pattern with MPI processes and get 20us latency, that would tend to imply that the network itself is not performing well with that IO pattern.
>>>> My job seems to do well so far with ofud !
>>>> [sboisver12_at_colosse2 ray]$ qstat
>>>> job-ID prior name user state submit/start at queue slots ja-task-ID
>>>> -----------------------------------------------------------------------------------------------------------------
>>>> 3047460 0.55384 fish-Assem sboisver12 r 09/21/2011 15:02:25 med_at_r104-n58 256
>>> I would still be suspicious -- ofud is not well tested, and it can definitely hang if there are network drops.