On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca <bosilca_at_[hidden]>wrote:
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time one fails.
> Yes the epoch is per process, but it is distributed among all participants.
> The difficulty here is to make sure the global view of the processes
> converges toward a common value of the epoch for each process.
Sounds racy...is it actually necessary to have a global agreement on epoch?
Per my other note, perhaps we really need a primer on this epoch concept.
> > So if I look at the epoch of the proc sending me a message, I really
> can't check it against my own value as the comparison is meaningless. All I
> really can do is check to see if it changed from the last time I heard from
> that proc, which would tell me that the proc has been restarted in the
> I fail to understand your statement here. However, comparing message epoch
> is critical to ensure the correct behavior. It ensures we do not react on
> old messages (that were floating in the system for some obscure reasons),
> and that we have the right contact information for a specific peer (on the
> correct epoch).
Again, maybe we need a better understanding of what you mean by epoch -
clearly, there is misunderstanding of what you are proposing to do.
I'm leery of anything that requires a general consensus as it creates a lot
of race conditions - might work under certain circumstances, but we've been
burned by that approach too many times.
> devel mailing list