Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Device failover in dr pml
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-15 08:57:17


Last anyone knew, the dr pml was dead - way out of date and
unmaintained. I gather that you folks have revived it and sync'd it
back up to the current ob1 module?

I don't think anyone really cares what is done with the dr module
itself. There are others working on failover modules, and there is a
new separate checksum module that just aborts if it detects an error.

So I would guess you are welcome to do whatever you want to it. I
suspect the others working on failover may speak up here too.

On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:

> Hi all,
>
> We are currently working on the dr pml component and specifically on
> device failover. The failover mecanism seems to work fine on
> different components, but if we want to do it on different modules
> of the same component - say 2 Infiniband rails - the code seems to
> be broken.
>
> Actually, when the first openib module fails, the progress function
> of the openib component is deregistered and progress is no longer
> made on any openib module. We managed to circumvent this by keeping
> the progress function as long as an openib module might be using it
> and it seems to work fine.
>
> So I have a few questions :
>
> 1. Is there already work in progress to support multi-module
> failover on the dr pml ?
> 2. Do you think this is the correct way to handle multi-module
> failover ?
>
> Also, the fact that the "dr" component includes many things like
> checksuming bothers us a bit (we'd like to lower performance
> overhead as far as possible when including device failover). So,
>
> 3. Do you plan to fork this component to a "df (device failover)
> only" one ? (we would like to, but maybe this is not the right way
> to go)
>
> That's all for now,
> Mouhamed
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel