Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Device failover on ob1
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-02 20:22:54


Okay - here's a thought. Why not do what the original message asked?
Checkout their changes and look at what they did.

Then we can have the discussion about how intrusive it is. Otherwise,
all we're doing is debating what they -might- have done, or what
someone thinks they -should- have done, etc.

Look at it first, and see how big or small a change is involved.
That's all they asked us to do - certainly seemed a reasonable request.

Just my $0.002

On Aug 2, 2009, at 4:49 PM, Graham, Richard L. wrote:

> The point here is very different, and is not being made because of
> objections for
> fail-over support. Previous work took precisely this sort of
> approach, and in that
> particular case the desire to support reliability, but be able to
> compile out this
> support still had a negative performance impact.
>
> This is why I am asking about precisely what assumptions are being
> made. If the
> assumption is that ompi can handle the failover with local
> information only, the
> impact on ompi is minimal, and the likelihood of needing to make
> undesirable
> changes to ob1 small. If ompi needs to deal with remote delivery -
> e.g. a
> send completed locally, but an ack did not arrive, is this because
> the remote side
> sent it and the connection failure kept it from arriving, or is it
> because the remote
> side did not send it at all, or maybe did not even get the data in
> the first plad
> - the logic becomes more complex, and one may end up
> wanting to change the way ob1 handles data to accommodate this....
> Said another
> way, there may not be as much commonality as was assumed.
>
> Rich
>
>
> On 8/2/09 6:19 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>
> The objections being cited are somewhat unfair - perhaps people do not
> understand the proposal being made? The developers have gone out of
> their way to ensure that all changes are configured out unless you
> specifically select to use that functionality. This has been our
> policy from day one - as long as the changes have zero impact unless
> the user specifically requests that it be used, then no harm is done.
>
> So I personally don't see any objection to bringing it into the code
> base. Latency is not impacted one bit -unless- someone deliberately
> configures the code to use this feature. In that case, they are
> deliberately accepting any impact in order to gain the benefits.
>
> Perhaps a bigger question needs to be addressed - namely, does the ob1
> code need to be refactored?
>
> Having been involved a little in the early discussion with bull when
> we debated over where to put this, I know the primary concern was that
> the code not suffer the same fate as the dr module. We have since run
> into a similar issue with the checksum module, so I know where they
> are coming from.
>
> The problem is that the code base is adjusted to support changes in
> ob1, which is still being debugged. On the order of 95% of the code in
> ob1 is required to be common across all the pml modules, so the rest
> of us have to (a) watch carefully all the commits to see if someone
> touches ob1, and then (b) manually mirror the change in our modules.
>
> This is not a supportable model over the long-term, which is why dr
> has died, and checksum is considering integrating into ob1 using
> configure #if's to avoid impacting non-checksum users. Likewise,
> device failover has been treated similarly here - i.e., configure out
> the added code unless someone wants it.
>
> This -does- lead to messier source code with these #if's in it. If we
> can refactor the ob1 code so the common functionality resides in the
> base, then perhaps we can avoid this problem.
>
> Is it possible?
> Ralph
>
> On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote:
>
>>
>>
>>
>> On 8/2/09 12:55 AM, "Brian Barrett" <brbarret_at_[hidden]> wrote:
>>
>> While I agree that performance impact (latency in this case) is
>> important, I disagree that this necessarily belongs somewhere other
>> than ob1. For example, a zero-performance impact solution would be
>> to
>> provide two versions of all the interface functions, one with
>> failover
>> turned on and one with it turned off, and select the appropriate
>> functions at initialization time. There are others, including
>> careful
>> placement of decision logic, which are likely to result in near-zero
>> impact. I'm not attempting to prescribe a solution, but refuting the
>> claim that this can't be in ob1 - I think more data is needed before
>> such a claim is made.
>>
>>>> Just another way to do handle set the function pointers.
>>
>> Mouhamed - can the openib btl try to re-establish a connection
>> between
>> two peers today (with your ob1 patches, obviously)? Would this allow
>> us to adapt to changing routes due to switch failures (assuming that
>> there are other physical routes around the failed switch, of course)?
>>
>>>> The big question is what are the assumptions that are being made
>>>> for this mode of failure recovery. If the assumption is that
>>>> local completion
>>>> implies remote delivery, the problem is simple to solve. If not,
>>>> heavier
>>>> weight protocols need to be used to cover the range of ways failure
>>>> may manifest itself.
>>
>> Rich
>>
>> Thanks,
>>
>> Brian
>>
>> On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:
>>
>>> What is the impact on sm, which is by far the most sensitive to
>>> latency. This really belongs in a place other than ob1. Ob1 is
>>> supposed to provide the lowest latency possible, and other pml's are
>>> supposed to be used for heavier weight protocols.
>>>
>>> On the technical side, how do you distinguish between a lot
>>> acknowledgement and an undelivered message ? You really don't want
>>> to try and deliver data into user space twice, as once a receive is
>>> complete, who knows what the user has done with that buffer ? A
>>> general treatment needs to be able to false negatives, and attempts
>>> to deliver the data more than once.
>>>
>>> How are you detecting missing acknowledgements ? Are you using some
>>> sort of timer ?
>>>
>>> Rich
>>>
>>> On 7/31/09 5:49 AM, "Mouhamed Gueye" <mouhamed.gueye_at_[hidden]>
>>> wrote:
>>>
>>> Hi list,
>>>
>>> Here is an update on our work concerning device failover.
>>>
>>> As many of you suggested, we reoriented our work on ob1 rather than
>>> dr
>>> and we now have a working prototype on top of ob1. The approach is
>>> to
>>> store btl descriptors sent to peers and delete them when we receive
>>> proof of delivery. So far, we rely on completion callback functions,
>>> assuming that the message is delivered when the completion function
>>> is
>>> called, that is the case of openib. When a btl module fails, it is
>>> removed from the endpoint's btl list and the next one is used to
>>> retransmit stored descriptors. No extra-message is transmitted, it
>>> only
>>> consists in additions to the header. It has been mainly tested with
>>> two
>>> IB modules, in both multi-rail (two separate networks) and multi-
>>> path (a
>>> big unique network).
>>>
>>> You can grab and test the patch here (applies on top of the trunk) :
>>> http://bitbucket.org/gueyem/ob1-failover/
>>>
>>> To compile with failover support, just define --enable-device-
>>> failover
>>> at configure. You can then run a benchmark, disconnect a port and
>>> see
>>> the failover operate.
>>>
>>> A little latency increase (~ 2%) is induced by the failover layer
>>> when
>>> no failover occurs. To accelerate the failover process on openib,
>>> you
>>> can try to lower the btl_openib_ib_timeout openib parameter to 15
>>> for
>>> example instead of 20 (default value).
>>>
>>> Mouhamed
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> --
>> Brian Barrett
>> Open MPI developer
>> http://www.open-mpi.org/
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel