Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-02-08 14:00:14


On Feb 8, 2007, at 1:12 PM, Alex Tumanov wrote:

> George,
>
> Looks like I have some values already set for openib and gm bandwidth:
> # ompi_info --param all all |grep -i band
> MCA btl: parameter "btl_gm_bandwidth" (current
> value: "250")
> MCA btl: parameter "btl_mvapi_bandwidth" (current
> value: "800")
> Approximate maximum bandwidth of
> interconnect
> MCA btl: parameter "btl_openib_bandwidth" (current
> value: "800")
> Approximate maximum bandwidth of
> interconnect

These are the default values, which are wrong most of the time. The
bandwidth is in Mbs, so the default value for GM is set to something
ridiculously low. What really matter is the ration between the
bandwidth of the components you plan to use.

> whereas, ompi_info reports no available parameters dealing with
> latency:
> # ompi_info --param all all |grep -i laten
> <no output>

Strange, the latency is supposed to be there too. Anyway, the latency
is only used to determine which one is faster, in order to use it for
small messages.

> Also, I'm not entirely sure what value to set the latency to,
> especially for tcp. It depends on so many factors and varies. Why does
> the latency value have effect on message striping? I can see how
> knowing the bandwidth limitations of available interconnects would
> allow you to proportionally divide up the message among them, but
> latency? Especially for large message sizes the time should be
> dominated by the bandwidth limitations.

As I said the bandwidth is in Mbs and the latency is micro-sec. But
what really matter for latency is the absolute value, as we will
order them starting from the smallest latency. For bandwidth, what
really matters is the relative ratio. We sum all bandwidths and they
we divide by the device bandwidth to find out how much data we should
send over each interconnect (that's really close with what happens
there).

How do I compute my latencies and bandwidths ? Well I run NetPIPE
over one interconnect at the time and get the latency for the message
size 1 and the bandwidth from the message size around 1MB. That
should give you a quite accurate values o start from. After you can
tweak them, in order to increase the ratio based on the latencies.

> Finally, what are the units for bandwidth and latency mca parameters
> and how did you arrive at the values you set in your params file? Is
> there a description of the message striping algorithm somewhere (other
> than code :) )?

Read the previous paragraph. Unfortunately, except the code and my
little explanation there is no documentation about how we do the
stripping ...

   Thanks,
     george.

>
> Thanks,
> Alex.
>
>
> On 2/8/07, George Bosilca <bosilca_at_[hidden]> wrote:
>> In order to get any performance improvement from stripping the
>> messages over multiple interconnects one has to specify the latency
>> and bandwidth for these interconnects, and to make sure that any of
>> them don't ask for exclusivity. I'm usually running over multiple TCP
>> interconnects and here is my mca-params.conf file:
>> btl_tcp_if_include = eth0,eth1
>> btl_tcp_max_rdma_size = 524288
>>
>> btl_tcp_latency_eth0 = 47
>> btl_tcp_bandwidth_eth0 = 587
>>
>> btl_tcp_latency_eth1 = 51
>> btl_tcp_bandwidth_eth1 = 233
>>
>> Something similar has to be done for openib and gm, in order to allow
>> us to strip the messages correctly.
>>
>> Thanks,
>> george.
>>
>> On Feb 8, 2007, at 12:02 PM, Alex Tumanov wrote:
>>
>>> Hello Jeff. Thanks for pointing out NetPipe to me. I've played
>>> around
>>> with it a little in hope to see clear evidence/effect of message
>>> striping in OpenMPI. Unfortunately, what I saw is that the result of
>>> running NPmpi over several interconnects is identical to running it
>>> over a single fastest one :-( That was not the expected behavior,
>>> and
>>> I'm hoping that I'm doing something wrong. I'm using NetPIPE_3.6.2
>>> over OMPI 1.1.4. NetPipe was compiled by making sure Open MPI's
>>> mpicc
>>> can be found and simply running 'make mpi' under NetPIPE_3.6.2
>>> directory.
>>>
>>> I experimented with 3 interconnects: openib, gm, and gig-e.
>>> Specifically, I found that the times (and, correspondingly,
>>> bandwidth)
>>> reported for openib+gm is pretty much identical to the times
>>> reported
>>> for just openib. Here are the commands I used to initiate the
>>> benchmark:
>>>
>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl openib,gm,self
>>> ~/NPmpi > ~/testdir/ompi/netpipe/ompi_netpipe_openib+gm.log 2>&1
>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl openib,self ~/
>>> NPmpi
>>>> ompi_netpipe_openib.log 2>&1
>>>
>>> Similarly, for tcp+gm the reported times were identical to just
>>> running the benchmark over gm alone. The commands were:
>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl tcp,gm,self --mca
>>> btl_tcp_if_exclude lo,ib0,ib1 ~/NPmpi
>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl gm,self ~/NPmpi
>>>
>>> Orthogonally, I've also observed that trying to use any
>>> combination of
>>> interconnects that includes openib (except using it exclusively)
>>> fails
>>> as soon as the benchmark reaches trials with 1.5MB message sizes. In
>>> fact the CPU load remained at 100% on the headnode, but no further
>>> output is sent to the log file or the screen (see the tails below).
>>> This behavior is fairly consistent and may be of interest to Open
>>> MPI
>>> development community. If anybody has tried using openib in
>>> combination with other interconnects please let me know what issues
>>> you've encountered and what tips and tricks you could share in this
>>> regard.
>>>
>>> Many thanks. Keep up the good work!
>>>
>>> Sincerely,
>>> Alex.
>>>
>>> Tails (the log file name reflects the combination of
>>> interconnects in
>>> that CL order):
>>> # tail ompi_netpipe_gm+openib.log
>>> 101: 786432 bytes 38 times --> 3582.46 Mbps in 1674.83
>>> usec
>>> 102: 786435 bytes 39 times --> 3474.50 Mbps in 1726.87
>>> usec
>>> 103: 1048573 bytes 19 times --> 3592.47 Mbps in 2226.87
>>> usec
>>> 104: 1048576 bytes 22 times --> 3515.15 Mbps in 2275.86
>>> usec
>>> 105: 1048579 bytes 21 times --> 3480.22 Mbps in 2298.71
>>> usec
>>> 106: 1572861 bytes 21 times --> 4174.76 Mbps in 2874.41
>>> usec
>>> 107: 1572864 bytes 23 times --> mpirun: killing job...
>>>
>>> # tail ompi_netpipe_openib+gm.log
>>> 100: 786429 bytes 45 times --> 3477.98 Mbps in 1725.13
>>> usec
>>> 101: 786432 bytes 38 times --> 3578.94 Mbps in 1676.47
>>> usec
>>> 102: 786435 bytes 39 times --> 3480.66 Mbps in 1723.82
>>> usec
>>> 103: 1048573 bytes 19 times --> 3594.26 Mbps in 2225.76
>>> usec
>>> 104: 1048576 bytes 22 times --> 3517.46 Mbps in 2274.37
>>> usec
>>> 105: 1048579 bytes 21 times --> 3482.13 Mbps in 2297.45
>>> usec
>>> 106: 1572861 bytes 21 times --> mpirun: killing job...
>>>
>>> # tail ompi_netpipe_openib+tcp+gm.log
>>> 100: 786429 bytes 45 times --> 3481.45 Mbps in 1723.41
>>> usec
>>> 101: 786432 bytes 38 times --> 3575.83 Mbps in 1677.93
>>> usec
>>> 102: 786435 bytes 39 times --> 3479.05 Mbps in 1724.61
>>> usec
>>> 103: 1048573 bytes 19 times --> 3589.68 Mbps in 2228.61
>>> usec
>>> 104: 1048576 bytes 22 times --> 3517.96 Mbps in 2274.05
>>> usec
>>> 105: 1048579 bytes 21 times --> 3484.12 Mbps in 2296.14
>>> usec
>>> 106: 1572861 bytes 21 times --> mpirun: killing job...
>>>
>>> # tail -5 ompi_netpipe_openib.log
>>> 119: 6291456 bytes 5 times --> 4036.63 Mbps in 11891.10
>>> usec
>>> 120: 6291459 bytes 5 times --> 4005.81 Mbps in 11982.61
>>> usec
>>> 121: 8388605 bytes 3 times --> 4033.78 Mbps in 15866.00
>>> usec
>>> 122: 8388608 bytes 3 times --> 4025.50 Mbps in 15898.66
>>> usec
>>> 123: 8388611 bytes 3 times --> 4017.58 Mbps in 15929.98
>>> usec
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> "Half of what I say is meaningless; but I say it so that the other
>> half may reach you"
>> Kahlil Gibran
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran