Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Device failover in dr pml (fwd)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-16 09:12:30


Sounds fine, though note that we don't want ob1 itself to do this as
it inevitably adds overhead that translates into latency. Instead, we
want that functionality to be in a separate component for those people
who want to use it.

We did talk on a telecon earlier this week about the need to refactor
the PML so that all these various PML components don't have to keep
tracking what is done in ob1 - bit of a pain. Nothing has been done
yet, but hopefully at some point we'll address this issue.

On Apr 16, 2009, at 2:33 AM, Sylvain Jeaugey wrote:

> Well, if reviving means making device failover work, then yes, in a
> way we revived it ;)
>
> We are currently making mostly experiments to figure out how to have
> device failover working. No big fixes for now, and that's why we are
> posting here before going further.
>
>> From what I understand, Rolf's work seems very close to what we
>> want to do
> and we'd better work with him on making ob1 able to do device
> failover rather than trying to work on dr.
>
> This sound good to me : there is no reason why ob1 couldn't
> invalidate a device (e.g. if we send a signal). However, replaying
> lost sends still seems to be needed if we want to be able to handle
> a network failure. Clearly, ob1 doesn't support this yet.
>
> Thanks a lot for your advices, we will continue to think about it
> and come back to you.
>
> Sylvain
>
> On Wed, 15 Apr 2009, Ralph Castain wrote:
>
>> Last anyone knew, the dr pml was dead - way out of date and
>> unmaintained. I gather that you folks have revived it and sync'd it
>> back up to the current ob1 module?
>> I don't think anyone really cares what is done with the dr module
>> itself. There are others working on failover modules, and there is
>> a new separate checksum module that just aborts if it detects an
>> error.
>> So I would guess you are welcome to do whatever you want to it. I
>> suspect the others working on failover may speak up here too.
>> On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:
>>> Hi all,
>>> We are currently working on the dr pml component and specifically
>>> on device failover. The failover mecanism seems to work fine on
>>> different components, but if we want to do it on different modules
>>> of the same component - say 2 Infiniband rails - the code seems to
>>> be broken.
>>> Actually, when the first openib module fails, the progress
>>> function of the openib component is deregistered and progress is
>>> no longer made on any openib module. We managed to circumvent this
>>> by keeping the progress function as long as an openib module might
>>> be using it and it seems to work fine.
>>> So I have a few questions :
>>> 1. Is there already work in progress to support multi-module
>>> failover on the dr pml ?
>>> 2. Do you think this is the correct way to handle multi-module
>>> failover ?
>>> Also, the fact that the "dr" component includes many things like
>>> checksuming bothers us a bit (we'd like to lower performance
>>> overhead as far as possible when including device failover). So,
>>> 3. Do you plan to fork this component to a "df (device failover)
>>> only" one ? (we would like to, but maybe this is not the right way
>>> to go)
>>> That's all for now,
>>> Mouhamed
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel