Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-02-15 22:38:52

> Good point, this may be affecting overall performance for openib+gm.
> But I didn't see any performance improvement for gm+tcp over just
> using gm (and there's definitely no memory bandwidth limitation
> there).

I wouldn't expect you to see any benefit with GM+TCP, the overhead
costs of TCP are so high that you may end up having a hard time
keeping up with GM and spending too much time trying to service TCP.

> Please correct me if I'm wrong, but it appears that message
> striping was implemented primarily having ethernet interfaces in mind.

This is not correct, striping was designed in a network agnostic
It is not optimal but it certainly was not designed primarily for

> It doesn't seem to have much
> impact when combining more "serious" interconnects. If anybody has
> tried this before and has evidence to the contrary, I'd love to hear
> it.

I guess I'm not sure what defines a "serious" interconnect, if you
mean interconnects with high bandwidth and low latency then I would
agree that the impact on measured bandwidth will show a bottleneck
elsewhere in the system such as memory.

>> So the "solution" for micro-benchmarks is to register the memory and
>> leave it registered. Probably the best way to do this is to use
>> MPI_ALLOC_MEM when allocating memory, this allows us to register the
>> memory with all the available NICs.
> Unfortunately, when it comes to using industry-standard benchmarking,
> it's undesirable to modify the source.

No argument here, just pointing out that the high cost of memory
registration is part of the equation.
You may also try -mca mpi_leave_pinned 1 if you haven't already.
I will be the first to admit however that this is entirely
artificial, but then again, some would argue that so is NetPipe.

>> I would also say that this is a very uncommon mode of operation, our
>> architecture allows it, but certainly isn't optimized for this case.
> I suspect, the issue also may be of purely business nature. The
> developers of BTL modules for advanced interconnects are most likely
> the employees of corresponding companies, which probably do not have
> any vested interest in making their interconnects synergistically
> coexist with the ones of their competitors or with interconnects the
> companies are dropping support for.

This is actually not the case, no interconnect company has (to this
date) created any BTL although many are now contributing, some to a
very large extent.
I can assure you that this is in no way an issue of "competitive
advantage" by intentionally not playing nicely together.
Rather, the real issue is one of time and monkeys, heterogeneous
multi-nic is not currently at the top of the list!

- Galen

> Many thanks,
> Alex.
>> On Feb 12, 2007, at 6:48 PM, Alex Tumanov wrote:
>>> Anyone else who may provide some feedback/comments on this issue?
>>> How
>>> typical/widespread is the use of multiple interconnects in the HPC
>>> community? Judging from the feedback I'm getting in this thread, it
>>> appears that this is fairly uncommon?
>>> Thanks for your attention to this thread.
>>> Alex.
>>> On 2/8/07, Alex Tumanov <atumanov_at_[hidden]> wrote:
>>>> Thanks for your insight George.
>>>>> Strange, the latency is supposed to be there too. Anyway, the
>>>>> latency
>>>>> is only used to determine which one is faster, in order to use it
>>>>> for
>>>>> small messages.
>>>> I searched the code base for mca parameter registering and did
>>>> indeed
>>>> discover that latency setting is possible for tcp and tcp alone:
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---------------------------------
>>>> [OMPISRCDIR]# grep -r param_register * |egrep -i "latency|
>>>> bandwidth"
>>>> ompi/mca/btl/openib/btl_openib_component.c:
>>>> mca_btl_openib_param_register_int("bandwidth", "Approximate maximum
>>>> bandwidth of interconnect",
>>>> ompi/mca/btl/tcp/btl_tcp_component.c: btl->super.btl_bandwidth =
>>>> mca_btl_tcp_param_register_int(param, 0);
>>>> ompi/mca/btl/tcp/btl_tcp_component.c: btl->super.btl_latency =
>>>> mca_btl_tcp_param_register_int(param, 0);
>>>> ompi/mca/btl/gm/btl_gm_component.c:
>>>> mca_btl_gm_param_register_int("bandwidth", 250);
>>>> ompi/mca/btl/mvapi/btl_mvapi_component.c:
>>>> mca_btl_mvapi_param_register_int("bandwidth", "Approximate maximum
>>>> bandwidth of interconnect",
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---------------------------------
>>>> For all others, btl_latency appears to be set to zero when the btl
>>>> module gets constructed. Would zero latency prevent message
>>>> striping?
>>>> An interesting side-issue that surfaces as a result of this little
>>>> investigation is the inconsistency between the ompi_info output and
>>>> the actual mca param availability for tcp_latency:
>>>> [OMPISRCDIR]# ompi_info --param all all |egrep -i "latency|
>>>> bandwidth"
>>>> MCA btl: parameter "btl_gm_bandwidth" (current
>>>> value: "250")
>>>> MCA btl: parameter "btl_mvapi_bandwidth" (current
>>>> value: "800")
>>>> Approximate maximum bandwidth of
>>>> interconnect
>>>> MCA btl: parameter "btl_openib_bandwidth" (current
>>>> value: "800")
>>>> Approximate maximum bandwidth of
>>>> interconnect
>>>> You also mentioned the exclusivity factor. I looked through the
>>>> code
>>>> for that, and it appears that interconnect btl module developers
>>>> are
>>>> setting exclusivity to various different integer values. In one
>>>> place,
>>>> the comment suggests that exclusivity is what gets used to
>>>> prioritize
>>>> interconnects... So a) I'm not sure what to set exclusivity to,
>>>> and b)
>>>> it's unclear whether its latency or exclusivity that determines the
>>>> order. According to btl.h and you - it's the latency, according to
>>>> the
>>>> following - exclusivity has something to do with it as well:
>>>> btl/mx/btl_mx_component.c : mca_base_param_reg_int(
>>>> (mca_base_component_t*)&mca_btl_mx_component, "exclusivity",
>>>> "Priority compared with the others
>>>> devices
>>>> (used only when several devices are available",
>>>> false, false, 50, (int*)
>>>> &mca_btl_mx_module.super.btl_exclusivity );
>>>> What should exclusivity be set to in order to allow using multiple
>>>> interconnects?
>>>> Finally,
>>>>> For bandwidth, what
>>>>> really matters is the relative ratio. We sum all bandwidths and
>>>>> they
>>>>> we divide by the device bandwidth to find out how much data we
>>>>> should
>>>>> send over each interconnect (that's really close with what happens
>>>>> there).
>>>> That's precisely how I would've done it and makes perfect sense.
>>>> Since
>>>> it's the relative ratio that matters and not the absolute value,
>>>> why
>>>> then my openib+gm test failed to deliver better bandwidth
>>>> performance
>>>> than just openib? I had bandwidth values set for both of those
>>>> btls.
>>>> The expected behavior in my case would be to send roughly 1/4
>>>> (250/1050) across gm and 3/4 (800/1050) across openib? My hunch is
>>>> that there's something else preventing message striping other than
>>>> incorrect absolute values for the bandwidth here...
>>>> Thanks a lot for your feedback on this one. It gave me good
>>>> pointers
>>>> to follow. Please do let me know if you can think of anything else
>>>> that I need to check.
>>>> Sincerely,
>>>> Alex.
>>>>>> On 2/8/07, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>> In order to get any performance improvement from stripping the
>>>>>>> messages over multiple interconnects one has to specify the
>>>>>>> latency
>>>>>>> and bandwidth for these interconnects, and to make sure that
>>>>>>> any of
>>>>>>> them don't ask for exclusivity. I'm usually running over
>>>>>>> multiple TCP
>>>>>>> interconnects and here is my mca-params.conf file:
>>>>>>> btl_tcp_if_include = eth0,eth1
>>>>>>> btl_tcp_max_rdma_size = 524288
>>>>>>> btl_tcp_latency_eth0 = 47
>>>>>>> btl_tcp_bandwidth_eth0 = 587
>>>>>>> btl_tcp_latency_eth1 = 51
>>>>>>> btl_tcp_bandwidth_eth1 = 233
>>>>>>> Something similar has to be done for openib and gm, in order to
>>>>>>> allow
>>>>>>> us to strip the messages correctly.
>>>>>>> Thanks,
>>>>>>> george.
>>>>>>> On Feb 8, 2007, at 12:02 PM, Alex Tumanov wrote:
>>>>>>>> Hello Jeff. Thanks for pointing out NetPipe to me. I've played
>>>>>>>> around
>>>>>>>> with it a little in hope to see clear evidence/effect of
>>>>>>>> message
>>>>>>>> striping in OpenMPI. Unfortunately, what I saw is that the
>>>>>>>> result of
>>>>>>>> running NPmpi over several interconnects is identical to
>>>>>>>> running it
>>>>>>>> over a single fastest one :-( That was not the expected
>>>>>>>> behavior,
>>>>>>>> and
>>>>>>>> I'm hoping that I'm doing something wrong. I'm using
>>>>>>>> NetPIPE_3.6.2
>>>>>>>> over OMPI 1.1.4. NetPipe was compiled by making sure Open MPI's
>>>>>>>> mpicc
>>>>>>>> can be found and simply running 'make mpi' under NetPIPE_3.6.2
>>>>>>>> directory.
>>>>>>>> I experimented with 3 interconnects: openib, gm, and gig-e.
>>>>>>>> Specifically, I found that the times (and, correspondingly,
>>>>>>>> bandwidth)
>>>>>>>> reported for openib+gm is pretty much identical to the times
>>>>>>>> reported
>>>>>>>> for just openib. Here are the commands I used to initiate the
>>>>>>>> benchmark:
>>>>>>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl
>>>>>>>> openib,gm,self
>>>>>>>> ~/NPmpi > ~/testdir/ompi/netpipe/ompi_netpipe_openib+gm.log
>>>>>>>> 2>&1
>>>>>>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl
>>>>>>>> openib,self ~/
>>>>>>>> NPmpi
>>>>>>>>> ompi_netpipe_openib.log 2>&1
>>>>>>>> Similarly, for tcp+gm the reported times were identical to just
>>>>>>>> running the benchmark over gm alone. The commands were:
>>>>>>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl tcp,gm,self
>>>>>>>> --mca
>>>>>>>> btl_tcp_if_exclude lo,ib0,ib1 ~/NPmpi
>>>>>>>> # mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl gm,self ~/
>>>>>>>> NPmpi
>>>>>>>> Orthogonally, I've also observed that trying to use any
>>>>>>>> combination of
>>>>>>>> interconnects that includes openib (except using it
>>>>>>>> exclusively)
>>>>>>>> fails
>>>>>>>> as soon as the benchmark reaches trials with 1.5MB message
>>>>>>>> sizes. In
>>>>>>>> fact the CPU load remained at 100% on the headnode, but no
>>>>>>>> further
>>>>>>>> output is sent to the log file or the screen (see the tails
>>>>>>>> below).
>>>>>>>> This behavior is fairly consistent and may be of interest to
>>>>>>>> Open
>>>>>>>> MPI
>>>>>>>> development community. If anybody has tried using openib in
>>>>>>>> combination with other interconnects please let me know what
>>>>>>>> issues
>>>>>>>> you've encountered and what tips and tricks you could share in
>>>>>>>> this
>>>>>>>> regard.
>>>>>>>> Many thanks. Keep up the good work!
>>>>>>>> Sincerely,
>>>>>>>> Alex.
> _______________________________________________
> users mailing list
> users_at_[hidden]