I've run across an interesting issue for which I don't have a ready answer.
If an MPI process aborts, we automatically abort the entire job.
If an MPI process returns a non-zero exit status, indicating that there was something abnormal about its termination, we ignore it and let the job continue. We do print an error message out upon completion of the job, but we don't terminate the job upon receiving the non-zero status. Note that non-zero status is considered a "standard" method of indicating abnormal termination, though no meaning has been agreed upon for the specific value.
Should we be allowing the job to continue in that circumstance? In the case I'm reviewing, the user's code indicates there is an error in the result. Since he has already called MPI_Finalize, he can't call MPI_Abort, and his system won't allow him to drop cores by calling "abort". So the exit status is his only way of indicating "abnormal termination".
Obviously, in this case, he would prefer the job terminate as nothing useful is going to be accomplished - so no point in tying up the machine.