On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca
<bosilca@eecs.utk.edu> wrote:
On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> But the epoch is process-unique - i.e., it is the number of times that this specific process has been started, which differs per proc since we don't restart all the procs every time one fails.
Yes the epoch is per process, but it is distributed among all participants. The difficulty here is to make sure the global view of the processes converges toward a common value of the epoch for each process.
Sounds racy...is it actually necessary to have a global agreement on epoch? Per my other note, perhaps we really need a primer on this epoch concept.
> So if I look at the epoch of the proc sending me a message, I really can't check it against my own value as the comparison is meaningless. All I really can do is check to see if it changed from the last time I heard from that proc, which would tell me that the proc has been restarted in the interim.
I fail to understand your statement here. However, comparing message epoch is critical to ensure the correct behavior. It ensures we do not react on old messages (that were floating in the system for some obscure reasons), and that we have the right contact information for a specific peer (on the correct epoch).
Again, maybe we need a better understanding of what you mean by epoch - clearly, there is misunderstanding of what you are proposing to do.
I'm leery of anything that requires a general consensus as it creates a lot of race conditions - might work under certain circumstances, but we've been burned by that approach too many times.
george.