Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Device failover on ob1
From: Mouhamed Gueye (mouhamed.gueye_at_[hidden])
Date: 2009-08-03 09:39:35


Hi list,

I'll try to answer to the main concerns so far.

We chose to work on ob1 for mainly 2 reasons:
- we focused first on fixing dr but were quite disappointed by its
performance in comparison with ob1. Then, we oriented our work on ob1 to
provide failover while keeping good performance.
- Secondly, we wanted to avoid as much as possible to fork ob1 to stay
up-to-date with the code base. Plus, the failover layer is so thin (in
comparison with the code base) that it would not make sense to fork the
base into a new pml.

But we were aware that ob1 won't allow any non-zero impact change and
that is why the added code is configured out by default. Actually, we
wanted to address long jobs that can afford a very little performance
loss but won't allow aborting after several hours or days of computation
because of one port failure. The goal of this prototype is to provide a
proof of concept for discussion, as we know there are other people
working on this subject.

As stated in the previous mail, the idea is to store any sent btl
descriptor until it is marked as delivered. For that, we rely on
completion callbacks and the assumption, clearly, is that a completion
function called means message delivery to the remote card. The
underlying btl is the one that ensures message delivery. This is
currently the case of the openib btl, but any other btl may be able to
do so. So, with that assumption, we do not need any pml level
acknowledgment protocol (no extra messages).
No timer is needed for retransmission as it is triggered by btl failure.
Today, only error callback scenario is implemented. We should also treat
btl send method return codes. To deal with message duplication, the
protocol maintains a message id allowing to track received messages
(hence the larger header). So any duplicated message will not be processed.

Concerning the openib btl, on a multi-port system, the connection scheme
is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port
1) <==> (host 2-port 1) for example. This is done at btl endpoint
initialization but when establishing connexion at first send attempt,
the port association information is not processed. This results in a
crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and
(host 1-port 1) <==> (host 2-port 0)). So, instead of having two
separate rings or paths, we have 1 big ring that does not allow
failover. We had to fix this to enable failover in both multi-path (same
network) and multi-rail (2 separate networks) with openib.

Brian, so far, we are able to switch from one failing btl to a safe one
only. When there is no more btl left, we abort the job. Next step is to
be able to re-establish the connection when the network is back.

Mouhamed
Graham, Richard L. a écrit :
> What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols.
>
> On the technical side, how do you distinguish between a lot acknowledgement and an undelivered message ? You really don't want to try and deliver data into user space twice, as once a receive is complete, who knows what the user has done with that buffer ? A general treatment needs to be able to false negatives, and attempts to deliver the data more than once.
>
> How are you detecting missing acknowledgements ? Are you using some sort of timer ?
>
> Rich
>
> On 7/31/09 5:49 AM, "Mouhamed Gueye" <mouhamed.gueye_at_[hidden]> wrote:
>
> Hi list,
>
> Here is an update on our work concerning device failover.
>
> As many of you suggested, we reoriented our work on ob1 rather than dr
> and we now have a working prototype on top of ob1. The approach is to
> store btl descriptors sent to peers and delete them when we receive
> proof of delivery. So far, we rely on completion callback functions,
> assuming that the message is delivered when the completion function is
> called, that is the case of openib. When a btl module fails, it is
> removed from the endpoint's btl list and the next one is used to
> retransmit stored descriptors. No extra-message is transmitted, it only
> consists in additions to the header. It has been mainly tested with two
> IB modules, in both multi-rail (two separate networks) and multi-path (a
> big unique network).
>
> You can grab and test the patch here (applies on top of the trunk) :
> http://bitbucket.org/gueyem/ob1-failover/
>
> To compile with failover support, just define --enable-device-failover
> at configure. You can then run a benchmark, disconnect a port and see
> the failover operate.
>
> A little latency increase (~ 2%) is induced by the failover layer when
> no failover occurs. To accelerate the failover process on openib, you
> can try to lower the btl_openib_ib_timeout openib parameter to 15 for
> example instead of 20 (default value).
>
> Mouhamed
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>