Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-07 18:35:29


Thanks - that helps!

On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland <wbland_at_[hidden]> wrote:

> Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node will be
> the only one to detect it.
>
> Also, if an ORTED is lost, all of the applications running underneath it
> are also lost because we have no way to communicate with them anymore.
>
> On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:
>
> Quick question: could you please clarify this statement:
>
> ...because more than one ORTED could (and often will) detect the failure.
>
>
> I don't understand how this can be true, except for detecting an ORTED
> failure. Only one orted can detect an MPI process failure, unless you have
> now involved orted's in MPI communications (and I don't believe you did). If
> the HNP directs another orted to restart that proc, and then that
> incarnation fails, then the epoch number -should- increment again, shouldn't
> it?
>
> So are you concerned (re having the HNP mark a proc down multiple times)
> about orted failure detection? In that case, I agree that you can have
> multiple failure detections - we dealt with it differently in orcm, but I
> have no issue with doing it another way. Just helps to know what problem you
> are trying to solve.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>