On Apr 14 2011, Ralph Castain wrote:
>I've run across an interesting issue for which I don't have a ready answer.
>
>If an MPI process aborts, we automatically abort the entire job.
>
> If an MPI process returns a non-zero exit status, indicating that there
> was something abnormal about its termination, we ignore it and let the
> job continue. We do print an error message out upon completion of the
> job, but we don't terminate the job upon receiving the non-zero status.
> Note that non-zero status is considered a "standard" method of indicating
> abnormal termination, though no meaning has been agreed upon for the
> specific value.
Not really. See below.
> Should we be allowing the job to continue in that circumstance? In the
> case I'm reviewing, the user's code indicates there is an error in the
> result. Since he has already called MPI_Finalize, he can't call
> MPI_Abort, and his system won't allow him to drop cores by calling
> "abort". So the exit status is his only way of indicating "abnormal
> termination".
>
> Obviously, in this case, he would prefer the job terminate as nothing
> useful is going to be accomplished - so no point in tying up the machine.
>
>Thoughts?
Blame Unix. Seriously. Many or most mainframes had the following
categories:
Complete success - or, rather, a failure to detect an error :-)
Partial success, with warnings of potential problems
Failure that was diagnosed and partially cleaned-up
Heap horrible failure - all bets are off
Unix has no such categorisation. The distinction between a zero return
and other values can occur at any point, and some programs even use them
as flags. It's hopeless, and whatever you do will be wrong for many
people. I have no idea what Microsoft do, but assume that it has copied
Unix, as that is its SOP. I recommend NOT rocking this boat.
He might do better by calling abort after MPI_Finalize, but that's
pretty iffy - just like all other approaches. To improve this needs a
new function or argument to MPI_Finalize.
Regards,
Nick Maclaren.
|