Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Exit status
From: N.M. Maclaren (nmm1_at_[hidden])
Date: 2011-04-14 04:02:48


On Apr 14 2011, Ralph Castain wrote:

>I've run across an interesting issue for which I don't have a ready answer.
>
>If an MPI process aborts, we automatically abort the entire job.
>
> If an MPI process returns a non-zero exit status, indicating that there
> was something abnormal about its termination, we ignore it and let the
> job continue. We do print an error message out upon completion of the
> job, but we don't terminate the job upon receiving the non-zero status.
> Note that non-zero status is considered a "standard" method of indicating
> abnormal termination, though no meaning has been agreed upon for the
> specific value.

Not really. See below.

> Should we be allowing the job to continue in that circumstance? In the
> case I'm reviewing, the user's code indicates there is an error in the
> result. Since he has already called MPI_Finalize, he can't call
> MPI_Abort, and his system won't allow him to drop cores by calling
> "abort". So the exit status is his only way of indicating "abnormal
> termination".
>
> Obviously, in this case, he would prefer the job terminate as nothing
> useful is going to be accomplished - so no point in tying up the machine.
>
>Thoughts?

Blame Unix. Seriously. Many or most mainframes had the following
categories:

    Complete success - or, rather, a failure to detect an error :-)
    Partial success, with warnings of potential problems
    Failure that was diagnosed and partially cleaned-up
    Heap horrible failure - all bets are off

Unix has no such categorisation. The distinction between a zero return
and other values can occur at any point, and some programs even use them
as flags. It's hopeless, and whatever you do will be wrong for many
people. I have no idea what Microsoft do, but assume that it has copied
Unix, as that is its SOP. I recommend NOT rocking this boat.

He might do better by calling abort after MPI_Finalize, but that's
pretty iffy - just like all other approaches. To improve this needs a
new function or argument to MPI_Finalize.

Regards,
Nick Maclaren.