Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-07 14:39:36

On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca <bosilca_at_[hidden]>wrote:

> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time one fails.
> Yes the epoch is per process, but it is distributed among all participants.
> The difficulty here is to make sure the global view of the processes
> converges toward a common value of the epoch for each process.

Sounds it actually necessary to have a global agreement on epoch?
Per my other note, perhaps we really need a primer on this epoch concept.

> > So if I look at the epoch of the proc sending me a message, I really
> can't check it against my own value as the comparison is meaningless. All I
> really can do is check to see if it changed from the last time I heard from
> that proc, which would tell me that the proc has been restarted in the
> interim.
> I fail to understand your statement here. However, comparing message epoch
> is critical to ensure the correct behavior. It ensures we do not react on
> old messages (that were floating in the system for some obscure reasons),
> and that we have the right contact information for a specific peer (on the
> correct epoch).

Again, maybe we need a better understanding of what you mean by epoch -
clearly, there is misunderstanding of what you are proposing to do.

I'm leery of anything that requires a general consensus as it creates a lot
of race conditions - might work under certain circumstances, but we've been
burned by that approach too many times.

> george.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]