Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Device failover in dr pml (fwd)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-16 09:12:30

Sounds fine, though note that we don't want ob1 itself to do this as
it inevitably adds overhead that translates into latency. Instead, we
want that functionality to be in a separate component for those people
who want to use it.

We did talk on a telecon earlier this week about the need to refactor
the PML so that all these various PML components don't have to keep
tracking what is done in ob1 - bit of a pain. Nothing has been done
yet, but hopefully at some point we'll address this issue.

On Apr 16, 2009, at 2:33 AM, Sylvain Jeaugey wrote:

> Well, if reviving means making device failover work, then yes, in a
> way we revived it ;)
> We are currently making mostly experiments to figure out how to have
> device failover working. No big fixes for now, and that's why we are
> posting here before going further.
>> From what I understand, Rolf's work seems very close to what we
>> want to do
> and we'd better work with him on making ob1 able to do device
> failover rather than trying to work on dr.
> This sound good to me : there is no reason why ob1 couldn't
> invalidate a device (e.g. if we send a signal). However, replaying
> lost sends still seems to be needed if we want to be able to handle
> a network failure. Clearly, ob1 doesn't support this yet.
> Thanks a lot for your advices, we will continue to think about it
> and come back to you.
> Sylvain
> On Wed, 15 Apr 2009, Ralph Castain wrote:
>> Last anyone knew, the dr pml was dead - way out of date and
>> unmaintained. I gather that you folks have revived it and sync'd it
>> back up to the current ob1 module?
>> I don't think anyone really cares what is done with the dr module
>> itself. There are others working on failover modules, and there is
>> a new separate checksum module that just aborts if it detects an
>> error.
>> So I would guess you are welcome to do whatever you want to it. I
>> suspect the others working on failover may speak up here too.
>> On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:
>>> Hi all,
>>> We are currently working on the dr pml component and specifically
>>> on device failover. The failover mecanism seems to work fine on
>>> different components, but if we want to do it on different modules
>>> of the same component - say 2 Infiniband rails - the code seems to
>>> be broken.
>>> Actually, when the first openib module fails, the progress
>>> function of the openib component is deregistered and progress is
>>> no longer made on any openib module. We managed to circumvent this
>>> by keeping the progress function as long as an openib module might
>>> be using it and it seems to work fine.
>>> So I have a few questions :
>>> 1. Is there already work in progress to support multi-module
>>> failover on the dr pml ?
>>> 2. Do you think this is the correct way to handle multi-module
>>> failover ?
>>> Also, the fact that the "dr" component includes many things like
>>> checksuming bothers us a bit (we'd like to lower performance
>>> overhead as far as possible when including device failover). So,
>>> 3. Do you plan to fork this component to a "df (device failover)
>>> only" one ? (we would like to, but maybe this is not the right way
>>> to go)
>>> That's all for now,
>>> Mouhamed
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]