Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-03-15 22:51:51


On Wed, 15 Mar 2006, Allan Menezes wrote:

> Dear Brian, I have the same setup as Mr. Chakrbarty with 16 nodes,
> Oscar 4.2.1 beta 4 and two Gigabit ethernet cards with two 16 and 24
> port switches one smart and the other managed. I use dhcp to get the IP
> addresses for one eth card(The Ip addresses of these range from
> 192.168.1.1 ... 16) and use static IP addresses for the other NIC of
> 192.168.5.1 ... 16. The MTU of the first is 9000 for both the nICs and
> switch. For the second the MTU is 1500 for both the switch and the NIC
> cards as the switch cannot go beyond an MTU of beyond 1500. Using the
> -mca btl tcp switch with the 192.168.1.1 .. 16 NICs included and the
> 192.168.5.1 ... 16 excluded by switches -mca btl_tcp_if_include
> eth1(MTU=9000) and -mca btl_tcp_if_exclude eth0 (MTU=1500) I get with
> HPL a performance of approx 28.3GigaFlops with both Open Mpi and Mpich2.
> But since as you say above if you include both gigabit cards with the
> switch -mca btl_tcp_if_include eth0,eth1 using Open Mpi 1.1 (beta) or
> 1.01 teh performance should increase for the same N and NB in HPL I get
> aslight performance decrease instead of increase of about 0.5 to 1
> gigaflop less. The hostfile is simply a1, a2 ... a16 using Oscar's DNS
> to resolve the domain name. Why is there a performance decrease?

As both of the network devices come from the same BTL (internal driver
denomination) they will both have a similar priority. Let me explain how
exactly the fragmenting work. First for small messages only one of the
devices will be used. For messages above a certain size (usualy first
fragment + max_frag_size) the rest of the data will be split between the 2
devices depending on the device capabilities. Hint: what are the device
capabilities ? Well our algorithm is based on the latency and
bandwidth. As it is difficult to compute them directly from Open MPI,
the user should provide them with the correct values if the 2 nics
don't have similar performance.

It is clear that for best latency the fastest of the 2 nics should be
used. Therefore, you should give a hint to open mpi which one is the
fastest one. There is a parameter for that called btl_tcp_latency_%device,
where %device is the name of your device. On a similar way you should
indicate what is the bandwidth for each nic in order to allow Open MPI to
correctly split the messages across all the nics (the parameter name is
btl_tcp_bandwidth_%device).

Now le't take an example: You have 2 devices eth0 and eth1. Fir of all,
you have to compute the latency and bandwidth for each of them (using
Netpipe). Once you have these 4 values you will add them in your
$(HOME)/.openmpi/mca-params.conf file.

btl_tcp_latency_eth0=30
btl_tcp_latency_eth1=40

and

btl_tcp_bandwidth_eth0=30
btl_tcp_bandwidth_eth1=70

Now there is one trick. While the latency is an absolute value, the
bandwidth is relative (to the total bandwidth). Therefore, you have to
compute the percentage of each of the networks based on their total
bandwidth. If let's say eth0 has a bandwidth of 280Mbs and eth1 has a
bandwidth of 580Mbs the correct values for the bandwidth will be:

btl_tcp_bandwidth_eth0=(280*100)/(280+580) [*32*]
btl_tcp_bandwidth_eth1=(580*100)/(280+580) [*well 100-32 ~ 68]

Now, once you have your 2 devices correctly configured run again Netpipe
and you will notice that the bandwidth will increase. Of course you have
to specify that you want to use both of them via "--mca btl_tcp_if_include
eth0,eth1"

  george.

"We must accept finite disappointment, but we must never lose infinite
hope."
                                  Martin Luther King