We did more tests concerning the latency using 512 MPI ranks
on our super-computer. (64 machines * 8 cores per machine)
By default in Ray, any rank can communicate directly with any other.
Thus we have a complete graph with 512 vertices and 130816 edges (512*511/2)
where vertices are ranks and edges are communication links.
When a rank sends to itself, the route length is 0 edge. Otherwise, the
length is 1 edge. However, 130816 is a lot of edges.
With this, the average latency in microseconds when requesting a reply
for a message of 4000 bytes
with Ray on our super-computer is 386 microseconds (standard deviation: 9).
Recently, Jeff Squyres highlighted that using such a communication pattern
is not recommended and "there are a bunch of different options you can
By pursuing different options, we reduced the latency to 158 microseconds
(standard deviation: 15). This is a drop of 59%.
To do so, we added a transparent message router in Ray.
First a random graph is created with n vertices and n*log2(n)/2 randomly
from the n*(n-1)/2 edges. The idea is that, on average, any rank has a
degree of log2(n)
instead of n.
With 512 ranks, this random graph has 2304 edges (512*9/2),
down from 130816 edges.
Note that this is not a 9-regular graph (not all vertices have a degree
but the average is 9.
Then , shortest routes are computed with Dijkstra's algorithm modified
to choose the less saturated route if more then one have
the same length.
The route lengths are quite small.
0 512 0.195312% # send a message to itself
1 4608 1.75781% # send a message to a directly
2 37644 14.36%
3 152972 58.3542% # most of them
4 65710 25.0664%
5 698 0.266266%
So my question is:
Does that indicate where the real problem is on our super-computer ?
Thanks a lot !
Also, would transparent message routing be easy to implement
directly in Open-MPI as a component ?
On 26/09/11 08:46 AM, Yevgeny Kliteynik wrote:
> On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote:
>> On 22-Sep-11 12:09 AM, Jeff Squyres wrote:
>>> On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:
>>>>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N ibv_rc_pingpongs?
>>>> With 11 ibv_rc_pingpong's
>>>> Code to do that => https://gist.github.com/1233173
>>>> Latencies are around 20 microseconds.
>>> This seems to imply that the network is to blame for the higher latency...?
>> Interesting... I'm getting the same latency with ibv_rc_pingpong.
>> I get 8.5 usec for a single ping-pong.
> BTW, I've just checked this with performance guys - ibv_rc_pingpong
> is not used for performance measurement but only as IB network
> sanity check, therefore it was never meant to give optimal performance.
> Use ib_write_lat instead.
> -- YK
>> Please run 'ibclearcounters' to reset fabric counters, then
>> ibdiagnet to make sure that the fabric is clean.
>> If you have 4x QDR cluster, run ibdiagnet as follows:
>> ibdiagnet --ls 10 --lw 4x
>> Check that you don't have any errors/warnings.
>> Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
>> Just replace the command in the script and the rest would be fine.
>> If the fabric is clean, you're supposed to get typical
>> latency of ~1.4 usec.
>> -- YK
>>> I.e., if you run the same pattern with MPI processes and get 20us latency, that would tend to imply that the network itself is not performing well with that IO pattern.
>>>> My job seems to do well so far with ofud !
>>>> [sboisver12_at_colosse2 ray]$ qstat
>>>> job-ID prior name user state submit/start at queue slots ja-task-ID
>>>> 3047460 0.55384 fish-Assem sboisver12 r 09/21/2011 15:02:25 med_at_r104-n58 256
>>> I would still be suspicious -- ofud is not well tested, and it can definitely hang if there are network drops.