On Sep 8, 2007, at 2:33 PM, Aurelien Bouteiller wrote:
> I agree (b) is not a good idea. However I am not very pleased by (a)
> either. It totally prevent any process Fault Tolerant mechanism if we
> go that way. If we plan to add some failure detection mechanism to
> RTE and failure management (to avoid Finalize to hang), we should add
> the ability to plug-in FT specific error handlers. The default error
> handler should do exactly what is proposed by Ralph, but nowhere else
> (than in this handler) the RTE code should assume that the
> application is aborting when a failure occurs. If it is a FT
> application it might just not abort and recover.
(b) sounds fine to me.
If you genericize the concept, I think it's compatible with FT:
1. during MPI_INIT, one of the MPI processes can request a "notify"
exit pattern for the job: a process must notify the RTE before it
actually exits (i.e., some ORTE notification during MPI_FINALIZE).
If a process exits before notifying the RTE, it's an error.
1a. The default action upon error can be to kill the entire job.
1b. If you desire plug-in-able error actions (e.g., not kill the
entire job), I'm *assuming* that our plugin frameworks can handle
2. for an FT MPI job, I assume that the MPI processes would either
not perform step 1 (i.e., the default action upon process exit is
nothing -- just like if you had run "mpirun -np 4 hostname"), or you
would select a specific action upon error/plugin for what to do when
a process exits without first notifying the RTE.