Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-07 15:08:45

Ah - thanks! That really helped clarify things. Much appreciated.

Will look at the patch in this light...

On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland <wbland_at_[hidden]> wrote:

> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
> I'm not sure how you can talk about an epoch being too high or too low,
> unless you are envisioning an overall system where procs try to maintain
> some global notion of the value - which sounds like a race condition begging
> to cause problems.
> When we say epoch we mean a value that is stored locally. When a failure is
> detected the detector notifies the HNP who notifies everyone else. Thus
> everyone will _eventually_ receive the notification that the process has
> failed. It may take a while for you to receive the notification, but in the
> meantime you will behave normally. When you do receive the notification that
> the failure occurred, you update your local copy of the epoch.
> This is similar to the definition of the "perfect" failure detector that
> Josh references. It doesn't matter if you don't find about the failure
> immediately, as long as you find out about it eventually. If you aren't
> actually in the same jobid as the failed process you might never find out
> about the failure because it does not apply to you.
> Are you then thinking that MPI processes are going to detect failure
> instead of local orteds?? Right now, no MPI process would ever report
> failure of a peer - the orted detects failure using the sigchild and reports
> it. What mechanism would the MPI procs use, and how would that be more
> reliable than sigchild??
> Definitely not. ORTEDs are the processes that detect and report the
> failures. They can detect the failure of other ORTEDs or of applications.
> Basically anything to which they have a connection.
> So right now the HNP can -never- receive more than one failure report at a
> time for a process. The only issue we've been working is that there are
> several pathways for reporting that error - e.g., if the orted detects the
> process fails and reports it, and then the orted itself fails, we can get
> multiple failure events back at the HNP before we respond to the first one.
> Not the same issue as having MPI procs reporting failures...
> This is where the epoch becomes necessary. When reporting a failure, you
> tell the HNP which process failed by name, including the epoch. Thus the HNP
> will not make a process as having failed twice (thus incrementing the epoch
> twice and notifying everyone about the failure twice). The HNP might receive
> multiple notifications because more than one ORTED could (and often will)
> detect the failure. It is easier to have the HNP decide what is a failure
> and what is a duplicate rather than have the ORTEDs reach some consensus
> about the fact that a process has failed. Much less overhead this way.
> I'm not sure what ORCM does in the respect, but I don't know of anything in
> ORTE that would track this data other than the process state and that
> doesn't keep track of anything beyond one failure (which admittedly isn't an
> issue until we implement process recovery).
> We aren't having any problems with process recovery and process state -
> without tracking epochs. We only track "incarnations" so that we can pass it
> down to the apps, which use that info to guide their restart.
> Could you clarify why you are having a problem in this regard? Might help
> to better understand your proposed changes.
> I think we're talking about the same thing here. The only difference is
> that I'm not looking at the ORCM code so I don't have the "incarnations".
> _______________________________________________
> devel mailing list
> devel_at_[hidden]