Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Multi-rail on openib
From: NiftyOMPI Tom Mitchell (niftyompi_at_[hidden])
Date: 2009-06-08 17:34:18


On 6/8/09, Sylvain Jeaugey <sylvain.jeaugey_at_[hidden]> wrote:
> Hi Tom,
>
> Yes, there is a goal in mind, and definetly not performance : we are
> working on device failover, i.e when a network adapter or switch fails,
> use the remaining one. We don't intend to improve performance with
> multi-rail (which as you said, will not happen unless you have a DDR card
> with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the
> doubled network :)).

??? dual rail does double the number of switch ports.
If you want to address switch failure each rail must connect to
a different switch. If you do not want to have isolated fabrics
you must have some additional ports on all switches to
connect the two fabrics and enough of them to maintain sufficient
bandwidth and connectivity when a switch fails. Thus, You are doubling
the fabric unless I am missing something. Is your second set
of switches so minimally connected that the second tree can
be installed with a small switch count.

What are the odds when port 1 fails that port 2 is going to
be live. Cable/ connector errors would be the most likely
case where port 2 would be live. In general if port 1 fails
I would expect port 2 to have issues too.

>
> The goal here is to use port 1 of each card as a primary way of
> communication with a fat tree and port 2 as a failover solution with a
> very light network, just to avoid aborting the MPI app or at least reach a
> checkpoint.

Most of the IB protocols used by MPI target a LID. There is no
existing notification path I know of that can replace LID-xyz with
LID-123. The subnet manager might be able to do this but begs
security issues.

Interesting problem.....

> Don't worry, another team is working on opensm, so that routing stays
> optimal.

Could be fun.... but I would hope that this not be an incompatible fork.

> Thanks for your warnings however, it's true that a lot of people see these
> "double port IB cards" as "doubled performance".
>
> Sylvain
>
> On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote:
>
>> On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote:
>>>
>>> See this FAQ entry for a description:
>>>
>>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup
>>>
>>> Right now, there's no way to force a particular connection pattern on
>>> the openib btl at run-time. The startup sequence has gotten
>>> sufficiently complicated / muddied over the years that it would be quite
>>> difficult to do so. Pasha is in the middle of revamping parts of the
>>> openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be
>>> desirable to fully clean up the full openib btl startup sequence when
>>> he's all finished.
>>>
>>>
>>> On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am working on multi-rail IB and I was wondering how connections are
>>>> established between ports. I have two hosts, each with 2 ports on a
>>>> same IB card, connected to the same switch.
>>>>
>>
>> Is there a goal in mind?
>>
>> In general multi-rail cards run into bandwidth and congestion issues
>> with the host bus. If your card's system side interface cannot support
>> the bandwidth of twin IB links then it is possible that bandwidth would
>> be reduced by the interaction.
>>
>> If the host bus and memory system is fast enough then
>> work with the vendor.
>>
>> In addition to system bandwidth the subnet manager may need to be enhanced
>> to be multi-port card aware. Since IB fabric routes are static it is
>> possible
>> to route or use pairs of links in an identical enough way that there is
>> little bandwidth gain when multiple switches are involved.
>>
>> Your two host case case may be simple enough....to explore
>> and/or generate illuminating or misleading results.
>> It is a good place to start.
>>
>> Start with a look at opensm and the fabric then watch how Open MPI
>> or your applications use the resulting LIDs. If you are using IB directly
>> and not MPI then the list of protocol choices grows dramatically but still
>> centers on LIDs as assigned by the subnet manager (see opensm).
>>
>> How man CPU cores (ranks) are you working with?
>>
>> Do be specific about the IB hardware and associated firmware....
>> there are multiple choices out there and the vendor may be able to
>> help.......
>>
>> --
>> T o m M i t c h e l l
>> Found me a new hat, now what?
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
        NiftyOMPI
        T o m   M i t c h e l l