I ran into something this week that I think may require consideration by the MPI Forum. Specifically, Rolf found a problem in their MTT runs where the tests expect mpirun to return a non-zero exit status because one or more application processes did so, even though all application procs terminate normally.
I jury-rigged a simple algo that has mpirun return the exit status of the lowest rank that returned non-zero in the case where the job terminated normally. We still return the exit code of the first process to abnormally terminate (i.e., the process that is first reported to the HNP - may not be the first process that aborted).
However, it begs the question - what is the actual behavior supposed to be in the case where all procs terminate normally, but some may return (possibly different) non-zero codes?
I asked a few MPI users, and got a different answer from every one of them. Only consistent response I got was that the MPI standard doesn't say what should happen (can someone confirm that?).
Here is a sampling of the responses:
1. return the exit status of the lowest rank that returned non-zero (which I implemented for now to silence Rolf's MTT problem)
2. return the exit status of the highest rank that returned non-zero
3. printout a histogram of exit statuses
- ranks 0-9: 0
- ranks 10-21,110: 1
- ranks 22-35,40-51: 2
4. printout ALL the exit statuses
5. ignore it - mpirun's exit code should only reflect OMPI internals. It is the app developer's responsibility to properly deal with non-zero exit conditions (e.g., by calling MPI_Abort).
When I circled back around with these alternatives, I got the expected answer: everyone felt that all of them were good, and wanted a cmd line option to select the behavior for their job. They also noted that --xml should cause any of them to output in a defined xml format.
As I told Rolf, I honestly don't care what we do in this case. All I ask for is a clearly defined behavior so I don't get yanked in multiple directions, constantly circling around from one solution to the next.
So if the MPI standard doesn't specify this behavior, could someone involved in the Forum -please- get it to address this??
In the interim, what do -we- think it should do?