Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] 答复: 答复: doubt on latency result with OpenMPI library
From: Wang,Yanfei(SYS) (wangyanfei01_at_[hidden])
Date: 2014-03-27 06:44:36


Hi,

Update:
If explicitly assign --mca btl tcp,sm,self and the traffic will go 10G TCP/IP link instead of 40G RDMA link, and the tcp/ip latency is 22us at average, which is reasonable.
[root_at_bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca btl tcp,sm,self osu_latency
# OSU MPI Latency Test v4.3
# Size Latency (us)
0 22.07
1 22.48
2 22.38
4 22.39
8 22.52
16 22.52
32 22.59
64 22.73
128 23.01
256 24.32
512 28.50
1024 31.06
2048 56.06
4096 68.53
8192 77.09
16384 105.23
32768 143.51
65536 229.79
131072 285.28
262144 423.26
524288 693.82
1048576 1634.03
2097152 3311.69
4194304 7055.16

The conclusion is that the “ –hostfile with 10G IP address” does enable traffic select 10G TCP/IP link, and mpirun select RDMA link by default even if you did not enable “--mca btl openib,sm,self”!
So, how to understand that “–hostfile” does not work fine and how to control the multi-HCA(NIC) traffic for MPI library?

Besides, the following command does not reflect any information about rdma transport parameter control except tcp parameter.

[root_at_bb-nsi-ib04 pt2pt]# ompi_info --param btl all
                 MCA btl: parameter "btl_tcp_if_include" (current value: "",
                          data source: default, level: 1 user/basic, type:
                          string)
                          Comma-delimited list of devices and/or CIDR
                          notation of networks to use for MPI communication
                          (e.g., "eth0,192.168.0.0/16"). Mutually exclusive
                          with btl_tcp_if_exclude.
                 MCA btl: parameter "btl_tcp_if_exclude" (current value:
                          "127.0.0.1/8,sppp", data source: default, level: 1
                          user/basic, type: string)
                          Comma-delimited list of devices and/or CIDR
                          notation of networks to NOT use for MPI
                          communication -- all devices not matching these
                          specifications will be used (e.g.,
                          "eth0,192.168.0.0/16"). If set to a non-default
                          value, it is mutually exclusive with
                          btl_tcp_if_include.
[root_at_bb-nsi-ib04 pt2pt]#

Hope to have a deep understand on it~

Thanks
--Yanfei

发件人: devel [mailto:devel-bounces_at_[hidden]] 代表 Wang,Yanfei(SYS)
发送时间: 2014年3月27日 18:17
收件人: Open MPI Developers
主题: [OMPI devel] 答复: doubt on latency result with OpenMPI library

HI,

“--map-by node” does remove this trouble.
---
Configuration:
Even if using mpi --hostfile to control traffic to go 10G TCP/IP network, and the latency still is 5us in both situation!
[root_at_bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
[root_at_bb-nsi-ib04 pt2pt]# ifconfig
eth0 Link encap:Ethernet HWaddr 20:0B:C7:26:3F:C3
          inet addr:192.168.71.4 Bcast:192.168.71.255 Mask:255.255.255.0
          inet6 addr: fe80::220b:c7ff:fe26:3fc3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:834635 errors:0 dropped:0 overruns:0 frame:0
          TX packets:339853 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:681908607 (650.3 MiB) TX bytes:103031295 (98.2 MiB)
10G eth0 is not rdma-enabled nic~


a. using openib
[root_at_bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
# OSU MPI Latency Test v4.3
# Size Latency (us)
0 5.20
1 5.36
2 5.31
4 5.34
8 5.46
16 5.35
32 5.44
64 5.48
128 6.74
256 6.87
512 7.05
1024 7.52
2048 8.38
4096 10.36
8192 14.18
16384 23.69
32768 31.91
65536 38.89
131072 47.76
262144 80.42
524288 137.52
1048576 251.81
2097152 485.23
4194304 948.08

b. have no explicit rdma setting.
[root_at_bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node osu_latency
# OSU MPI Latency Test v4.3
# Size Latency (us)
0 5.23
1 5.28
2 5.21
4 5.33
8 5.33
16 5.36
32 5.33
64 5.41
128 6.74
256 6.98
512 7.11
1024 7.47
2048 8.46
4096 10.38
8192 14.30
16384 21.20
32768 31.21
65536 39.85
131072 47.70
262144 80.24
524288 137.59
1048576 251.62
2097152 485.14
4194304 945.80
[root_at_bb-nsi-ib04 pt2pt]#

I found that the bandwidth got from osu_bw benchmark is equal to 40G RDMA HCA, so I doubt if the traffic always goes between 40G RDMA link, and the control for TCP/IP link does work.

I will consult the FAQ for details, if further suggestion, welcome..

Thanks
--Yanfei
发件人: devel [mailto:devel-bounces_at_[hidden]] 代表 Ralph Castain
发送时间: 2014年3月27日 18:05
收件人: Open MPI Developers
主题: Re: [OMPI devel] doubt on latency result with OpenMPI library

Try adding "--map-by node" to your command line to ensure the procs really are running on separate nodes.

On Thu, Mar 27, 2014 at 1:40 AM, Wang,Yanfei(SYS) <wangyanfei01_at_[hidden]<mailto:wangyanfei01_at_[hidden]>> wrote:
Hi,

HW Test Topology:
Ip:192.168.72.4/24<http://192.168.72.4/24> –192.168.72.4/24<http://192.168.72.4/24>, enable vlan and RoCE
IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: configure it as RoCE link
IP: 192.168.71.3/24<http://192.168.71.3/24> ---192.168.71.4/24<http://192.168.71.4/24>
IB03 server 10G port – 10G Ethernet switch – IB04 server 10G port: configure it as normal TCP/IP Ethernet link:(server management interface)

Mpi configuration:
MPI Hosts file:
[root_at_bb-nsi-ib04 pt2pt]# cat hosts
ib03 slots=1
ib04 slots=1
DNS hosts
[root_at_bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
[root_at_bb-nsi-ib04 pt2pt]#
This configuration will create 2 nodes for MPI latency evaluation

Benchmark:
osu-micro-benchmarks-4.3

result:

a. Enable traffic go between 10G TCP/IP port using following /etc/hosts file

root_at_bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
The average latency is 4.5us of osu_latency, see log following:
[root_at_bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency
# OSU MPI Latency Test v4.3
# Size Latency (us)
0 4.56
1 4.90
2 4.90
4 4.60
8 4.71
16 4.72
32 5.40
64 4.77
128 6.74
256 7.01
512 7.14
1024 7.63
2048 8.22
4096 10.39
8192 14.26
16384 20.80
32768 31.97
65536 37.75
131072 47.28
262144 80.40
524288 137.65
1048576 250.17
2097152 484.71
4194304 946.01


b. Enable traffic go between RoCE link using /etc/hosts as following and mpirun –mca btl openib,self,sm …
[root_at_bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.72.3 ib03
192.168.72.4 ib04
Result:
[root_at_bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
# OSU MPI Latency Test v4.3
# Size Latency (us)
0 4.83
1 5.17
2 5.12
4 5.25
8 5.38
16 5.40
32 5.19
64 5.04
128 6.74
256 7.04
512 7.34
1024 7.91
2048 8.17
4096 10.39
8192 14.22
16384 22.05
32768 31.68
65536 37.57
131072 48.25
262144 79.98
524288 137.66
1048576 251.38
2097152 485.66
4194304 947.81
[root_at_bb-nsi-ib04 pt2pt]#

Question:

1. Why do they have similar latency, 5us, which is too small to believe it! In our test environment, it will take more than 50 us to deal with tcp sync and return sync_ack, and also x86 server will take more thans 20us at average to do ip forwarding(test from professional HW tester), so does the latency is reasonable?

2. Normally, the switch will introduces more than 1.5us switch time! Using accelio, a mellanox released opensource rdma library, it will take at least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency (TCP/IP and RoCE) above is rather unbelievable…

3. The fact that the tcp/ip transport and roce RDMA transport acquire same latency is so puzzling..


Before deeply understanding what happened inside the MPI benchmark, can show us some suggestion? Does the mpirun command works correctly here?
It must has some mistakes about this test, pls correct me,.

Eg: tcp syn&sync ack latency:
[cid:image001.png_at_01CF49EA.9B354C10]

Thanks
-Yanfei

_______________________________________________
devel mailing list
devel_at_[hidden]<mailto:devel_at_[hidden]>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14400.php




image001.png