Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).
The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is guillimin).
Here are the relevant information listed in http://www.open-mpi.org/community/help/
1. Check the FAQ first.
2. The version of Open MPI that you're using.
3. The config.log file from the top-level Open MPI directory, if available (please compress!).
Command file: http://pastebin.com/mW32ntSJ
4. The output of the "ompi_info --all" command from the node where you're invoking mpirun.
ompi_info -a on colosse: http://pastebin.com/RPyY9s24
5. If running on more than one node -- especially if you're having problems launching Open MPI processes -- also include the output of the "ompi_info -v ompi full --parsable" command from each node on which you're trying to run.
I am not having problems launching Open-MPI processes.
6. A detailed description of what is failing.
Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).
The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin).
I am developing a distributed genome assembler that runs with the message-passing interface (I am a PhD student).
It is called Ray. Link: http://github.com/sebhtml/ray
I recently added the option -test-network-only so that Ray can be used to test the latency. Each MPI rank has to send 100000 messages (4000 bytes each), one by one.
The destination of any message is picked up at random.
On colosse, a super-computer located at Laval University, I get an average latency of 250 microseconds with the test done in Ray.
On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.
colosse has 8 compute cores per node (Intel Nehalem).
Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds.
local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
1000 iters in 0.01 seconds = 11.35 usec/iter
So I know that the Infiniband has a correct latency between two HCAs because of the output of ibv_rc_pingpong.
Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI detects the hardware correctly:
[r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 26428
[r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Hermon
So I don't think this is the problem described in the FAQ ( http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency )
and on the mailing list ( http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the INI values are found.
Running the network test implemented in Ray on 32 MPI ranks, I get an average latency of 65 microseconds.
Thus, with 256 MPI ranks I get an average latency of 250 microseconds and with 32 MPI ranks I get 65 microseconds.
Running the network test on 32 MPI ranks again but only allowing the MPI rank 0 to send messages gives a latency of 10 microseconds for this rank.
Because I get 10 microseconds in the network test in Ray when only the MPI rank sends messages, I would say that there may be some I/O contention.
To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per node to send messages (there are 8 MPI ranks per node and a total of 32 MPI ranks).
Ranks 0, 8, 16 and 24 all reported 13 microseconds. See http://pastebin.com/h84Fif3g
The next test was to allow 2 MPI ranks on each node to send messages. Ranks 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds.
With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 16, 17, 18, 24, 25, 26 reported 20 microseconds. See http://pastebin.com/TCd6xpuC
Finally, with 4 MPI ranks per node that can send messages, I got 23 microseconds. See http://pastebin.com/V8zjae7s
So the MPI ranks on a given node seem to fight for access to the HCA port.
Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See http://pastebin.com/VXMAZdeZ
At this point, some may think that there may be a bug in the network test itself. So I tested the same code on another super-computer.
On guillimin, a super-computer located at McGill University, I get an average latency (with Ray -test-network-only) of 10 microseconds when running Ray on 512 MPI ranks.
On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is MVAPICH2 1.6.
Thus, I know that the network test in Ray works as expected because results on guillimin show a latency of 10 microseconds for 512 MPI ranks.
guillimin also has 8 compute cores per node (Intel Nehalem).
On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is 4096 bytes. See http://pastebin.com/35T8N5t8
In Ray, only the following MPI functions are utilised:
7. Please include information about your network:
7.1. Which OpenFabrics version are you running?
7.2. What distro and version of Linux are you running? What is your kernel version?
CentOS release 5.6 (Final)
Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
7.3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.)
7.4. What is the output of the ibv_devinfo command
state: active (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
7.5. What is the output of the ifconfig command
Not using IPoIB.
7.6. If running under Bourne shells, what is the output of the "ulimit -l" command?
[sboisver12_at_colosse1 ~]$ ulimit -l
The two differences I see between guillimin and colosse are
- Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin)
- Mellanox (colosse) v. QLogic (guillimin)
Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox HCAs ?
Thank you for your time.