Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2011-09-20 08:14:44


Hi Sébastien,

If I understand you correctly, you are running your application on two
different MPIs on two different clusters with two different IB vendors.

Could you make a comparison more "apples to apples"-ish?
For instance:
 - run the same version of Open MPI on both clusters
 - run the same version of MVAPICH on both clusters

-- YK

On 18-Sep-11 1:59 AM, Sébastien Boisvert wrote:
> Hello,
>
> Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).
>
> The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is guillimin).
>
>
> Here are the relevant information listed in http://www.open-mpi.org/community/help/
>
>
> 1. Check the FAQ first.
>
> done !
>
>
> 2. The version of Open MPI that you're using.
>
> Open-MPI 1.4.3
>
>
> 3. The config.log file from the top-level Open MPI directory, if available (please compress!).
>
> See below.
>
> Command file: http://pastebin.com/mW32ntSJ
>
>
> 4. The output of the "ompi_info --all" command from the node where you're invoking mpirun.
>
> ompi_info -a on colosse: http://pastebin.com/RPyY9s24
>
>
> 5. If running on more than one node -- especially if you're having problems launching Open MPI processes -- also include the output of the "ompi_info -v ompi full --parsable" command from each node on which you're trying to run.
>
> I am not having problems launching Open-MPI processes.
>
>
> 6. A detailed description of what is failing.
>
> Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).
>
> The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin).
>
> Details follow.
>
>
> I am developing a distributed genome assembler that runs with the message-passing interface (I am a PhD student).
> It is called Ray. Link: http://github.com/sebhtml/ray
>
> I recently added the option -test-network-only so that Ray can be used to test the latency. Each MPI rank has to send 100000 messages (4000 bytes each), one by one.
> The destination of any message is picked up at random.
>
>
> On colosse, a super-computer located at Laval University, I get an average latency of 250 microseconds with the test done in Ray.
>
> See http://pastebin.com/9nyjSy5z
>
> On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.
>
> colosse has 8 compute cores per node (Intel Nehalem).
>
>
> Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds.
>
> local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
> remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
> 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
> 1000 iters in 0.01 seconds = 11.35 usec/iter
>
> So I know that the Infiniband has a correct latency between two HCAs because of the output of ibv_rc_pingpong.
>
>
>
> Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI detects the hardware correctly:
>
> [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 26428
> [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Hermon
>
> see http://pastebin.com/pz03f0B3
>
>
> So I don't think this is the problem described in the FAQ ( http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency )
> and on the mailing list ( http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the INI values are found.
>
>
>
>
> Running the network test implemented in Ray on 32 MPI ranks, I get an average latency of 65 microseconds.
>
> See http://pastebin.com/nWDmGhvM
>
>
> Thus, with 256 MPI ranks I get an average latency of 250 microseconds and with 32 MPI ranks I get 65 microseconds.
>
>
> Running the network test on 32 MPI ranks again but only allowing the MPI rank 0 to send messages gives a latency of 10 microseconds for this rank.
> See http://pastebin.com/dWMXsHpa
>
>
>
> Because I get 10 microseconds in the network test in Ray when only the MPI rank sends messages, I would say that there may be some I/O contention.
>
> To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per node to send messages (there are 8 MPI ranks per node and a total of 32 MPI ranks).
> Ranks 0, 8, 16 and 24 all reported 13 microseconds. See http://pastebin.com/h84Fif3g
>
> The next test was to allow 2 MPI ranks on each node to send messages. Ranks 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds.
> See http://pastebin.com/REdhJXkS
>
> With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 16, 17, 18, 24, 25, 26 reported 20 microseconds. See http://pastebin.com/TCd6xpuC
>
> Finally, with 4 MPI ranks per node that can send messages, I got 23 microseconds. See http://pastebin.com/V8zjae7s
>
>
> So the MPI ranks on a given node seem to fight for access to the HCA port.
>
> Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See http://pastebin.com/VXMAZdeZ
>
>
>
>
>
>
> At this point, some may think that there may be a bug in the network test itself. So I tested the same code on another super-computer.
>
> On guillimin, a super-computer located at McGill University, I get an average latency (with Ray -test-network-only) of 10 microseconds when running Ray on 512 MPI ranks.
>
> See http://pastebin.com/nCKF8Xg6
>
> On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is MVAPICH2 1.6.
>
> Thus, I know that the network test in Ray works as expected because results on guillimin show a latency of 10 microseconds for 512 MPI ranks.
>
> guillimin also has 8 compute cores per node (Intel Nehalem).
>
> On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is 4096 bytes. See http://pastebin.com/35T8N5t8
>
>
>
>
>
>
>
>
> In Ray, only the following MPI functions are utilised:
>
> - MPI_Init
> - MPI_Comm_rank
> - MPI_Comm_size
> - MPI_Finalize
>
> - MPI_Isend
>
> - MPI_Request_free
> - MPI_Test
> - MPI_Get_count
> - MPI_Start
> - MPI_Recv_init
> - MPI_Cancel
>
> - MPI_Get_processor_name
>
>
>
>
> 7. Please include information about your network:
> http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
>
> Type: Infiniband
>
> 7.1. Which OpenFabrics version are you running?
>
>
> ofed-scripts-1.4.2-0_sunhpc1
>
> libibverbs-1.1.3-2.el5
> libibverbs-utils-1.1.3-2.el5
> libibverbs-devel-1.1.3-2.el5
>
>
> 7.2. What distro and version of Linux are you running? What is your kernel version?
>
>
> CentOS release 5.6 (Final)
>
> Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
>
>
> 7.3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.)
>
> opensm-libs-3.3.3-1.el5_6.1
>
> 7.4. What is the output of the ibv_devinfo command
>
> hca_id: mlx4_0
> fw_ver: 2.7.000
> node_guid: 5080:0200:008d:8f88
> sys_image_guid: 5080:0200:008d:8f8b
> vendor_id: 0x02c9
> vendor_part_id: 26428
> hw_ver: 0xA0
> board_id: X6275_QDR_IB_2.5
> phys_port_cnt: 1
> port: 1
> state: active (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1222
> port_lid: 659
> port_lmc: 0x00
>
>
>
> 7.5. What is the output of the ifconfig command
>
> Not using IPoIB.
>
> 7.6. If running under Bourne shells, what is the output of the "ulimit -l" command?
>
> [sboisver12_at_colosse1 ~]$ ulimit -l
> 6000000
>
>
>
>
>
>
>
> The two differences I see between guillimin and colosse are
>
> - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin)
> - Mellanox (colosse) v. QLogic (guillimin)
>
>
> Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox HCAs ?
>
>
>
>
>
>
> Thank you for your time.
>
>
> Sébastien Boisvert
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>