Did you have the param set? I found some missing code in the orted errmgr that contributed to it, but unless you had set the param in your test, there is no way it would abort no matter how many procs exit with non-zero status.

I'm guessing you have that param set in your test due to our earlier defining the default to "no abort". I'm content to leave it there, but wanted to ensure your tests ran clean.

On Apr 13, 2012, at 4:32 PM, TERRY DONTJE wrote:

I could see if less then N processes exit with non-zero exit code that the ORTE may choose not to abort the job.  However, if all N processes have exited or aborted I expect everything to clean up and mpirun to exit.  It does not do that at the moment which I think is what is causing most of the hangs in the MTT trunk runs which did not occur prior to this week.

--td

On 4/13/2012 5:18 PM, Ralph Castain wrote:
This has come up again because some of the MTT tests depend on a specific behavior when a process exits with a non-zero status - in this case, they expect ORTE to abort the job. At some point, the default had been switched to NOT abort the job if a process exited with a non-zero status.

So I'll throw this out to the community: if any process exits with a non-zero status, should ORTE abort the job?

I don't personally care, but we ought to decide on something. In the meantime, I will set the default so we DO abort, thus allowing the MTT runs to complete correctly.

FWIW: the MCA param orte_abort_non_zero_exit can always be set to control this behavior.

Ralph


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com



_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel