Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-06-07 12:37:29

On Jun 7, 2011, at 12:14 , Ralph Castain wrote:

> But the epoch is process-unique - i.e., it is the number of times that this specific process has been started, which differs per proc since we don't restart all the procs every time one fails.

Yes the epoch is per process, but it is distributed among all participants. The difficulty here is to make sure the global view of the processes converges toward a common value of the epoch for each process.

> So if I look at the epoch of the proc sending me a message, I really can't check it against my own value as the comparison is meaningless. All I really can do is check to see if it changed from the last time I heard from that proc, which would tell me that the proc has been restarted in the interim.

I fail to understand your statement here. However, comparing message epoch is critical to ensure the correct behavior. It ensures we do not react on old messages (that were floating in the system for some obscure reasons), and that we have the right contact information for a specific peer (on the correct epoch).