Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks
From: Sébastien Boisvert (sebastien.boisvert.3_at_[hidden])
Date: 2011-09-17 18:59:31


Hello,

Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).

The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is guillimin).

Here are the relevant information listed in http://www.open-mpi.org/community/help/

1. Check the FAQ first.

done !

2. The version of Open MPI that you're using.

Open-MPI 1.4.3

3. The config.log file from the top-level Open MPI directory, if available (please compress!).

See below.

Command file: http://pastebin.com/mW32ntSJ

4. The output of the "ompi_info --all" command from the node where you're invoking mpirun.

ompi_info -a on colosse: http://pastebin.com/RPyY9s24

5. If running on more than one node -- especially if you're having problems launching Open MPI processes -- also include the output of the "ompi_info -v ompi full --parsable" command from each node on which you're trying to run.

I am not having problems launching Open-MPI processes.

6. A detailed description of what is failing.

Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).

The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin).

Details follow.

I am developing a distributed genome assembler that runs with the message-passing interface (I am a PhD student).
It is called Ray. Link: http://github.com/sebhtml/ray

I recently added the option -test-network-only so that Ray can be used to test the latency. Each MPI rank has to send 100000 messages (4000 bytes each), one by one.
The destination of any message is picked up at random.

On colosse, a super-computer located at Laval University, I get an average latency of 250 microseconds with the test done in Ray.

See http://pastebin.com/9nyjSy5z

On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.

colosse has 8 compute cores per node (Intel Nehalem).

Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds.

  local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
  remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
1000 iters in 0.01 seconds = 11.35 usec/iter

So I know that the Infiniband has a correct latency between two HCAs because of the output of ibv_rc_pingpong.

Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI detects the hardware correctly:

[r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 26428
[r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Hermon

see http://pastebin.com/pz03f0B3

So I don't think this is the problem described in the FAQ ( http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency )
and on the mailing list ( http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the INI values are found.

Running the network test implemented in Ray on 32 MPI ranks, I get an average latency of 65 microseconds.

See http://pastebin.com/nWDmGhvM

Thus, with 256 MPI ranks I get an average latency of 250 microseconds and with 32 MPI ranks I get 65 microseconds.

Running the network test on 32 MPI ranks again but only allowing the MPI rank 0 to send messages gives a latency of 10 microseconds for this rank.
See http://pastebin.com/dWMXsHpa

Because I get 10 microseconds in the network test in Ray when only the MPI rank sends messages, I would say that there may be some I/O contention.

To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per node to send messages (there are 8 MPI ranks per node and a total of 32 MPI ranks).
Ranks 0, 8, 16 and 24 all reported 13 microseconds. See http://pastebin.com/h84Fif3g

The next test was to allow 2 MPI ranks on each node to send messages. Ranks 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds.
See http://pastebin.com/REdhJXkS

With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 16, 17, 18, 24, 25, 26 reported 20 microseconds. See http://pastebin.com/TCd6xpuC

Finally, with 4 MPI ranks per node that can send messages, I got 23 microseconds. See http://pastebin.com/V8zjae7s

So the MPI ranks on a given node seem to fight for access to the HCA port.

Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See http://pastebin.com/VXMAZdeZ

At this point, some may think that there may be a bug in the network test itself. So I tested the same code on another super-computer.

On guillimin, a super-computer located at McGill University, I get an average latency (with Ray -test-network-only) of 10 microseconds when running Ray on 512 MPI ranks.

See http://pastebin.com/nCKF8Xg6

On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is MVAPICH2 1.6.

Thus, I know that the network test in Ray works as expected because results on guillimin show a latency of 10 microseconds for 512 MPI ranks.

guillimin also has 8 compute cores per node (Intel Nehalem).

On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is 4096 bytes. See http://pastebin.com/35T8N5t8

In Ray, only the following MPI functions are utilised:

- MPI_Init
- MPI_Comm_rank
- MPI_Comm_size
- MPI_Finalize

- MPI_Isend

- MPI_Request_free
- MPI_Test
- MPI_Get_count
- MPI_Start
- MPI_Recv_init
- MPI_Cancel

- MPI_Get_processor_name

7. Please include information about your network:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot

Type: Infiniband

  7.1. Which OpenFabrics version are you running?

ofed-scripts-1.4.2-0_sunhpc1

libibverbs-1.1.3-2.el5
libibverbs-utils-1.1.3-2.el5
libibverbs-devel-1.1.3-2.el5

  7.2. What distro and version of Linux are you running? What is your kernel version?

CentOS release 5.6 (Final)

Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

  7.3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.)

opensm-libs-3.3.3-1.el5_6.1

  7.4. What is the output of the ibv_devinfo command

    hca_id: mlx4_0
            fw_ver: 2.7.000
            node_guid: 5080:0200:008d:8f88
            sys_image_guid: 5080:0200:008d:8f8b
            vendor_id: 0x02c9
            vendor_part_id: 26428
            hw_ver: 0xA0
            board_id: X6275_QDR_IB_2.5
            phys_port_cnt: 1
                    port: 1
                            state: active (4)
                            max_mtu: 2048 (4)
                            active_mtu: 2048 (4)
                            sm_lid: 1222
                            port_lid: 659
                            port_lmc: 0x00

  7.5. What is the output of the ifconfig command

  Not using IPoIB.

  7.6. If running under Bourne shells, what is the output of the "ulimit -l" command?

[sboisver12_at_colosse1 ~]$ ulimit -l
6000000

The two differences I see between guillimin and colosse are

- Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin)
- Mellanox (colosse) v. QLogic (guillimin)

Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox HCAs ?

Thank you for your time.

                Sébastien Boisvert