Quick question: could you please clarify this statement:
...because more than one ORTED could (and often will) detect the failure.
I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved orted's in MPI communications (and I don't believe you did). If
the HNP directs another orted to restart that proc, and then that
incarnation fails, then the epoch number -should- increment again, shouldn't
So are you concerned (re having the HNP mark a proc down multiple times)
about orted failure detection? In that case, I agree that you can have
multiple failure detections - we dealt with it differently in orcm, but I
have no issue with doing it another way. Just helps to know what problem you
are trying to solve.