Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Non-zero exit status
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2012-04-14 07:47:34


On 4/13/2012 6:40 PM, Ralph Castain wrote:
> Did you have the param set? I found some missing code in the orted
> errmgr that contributed to it, but unless you had set the param in
> your test, there is no way it would abort no matter how many procs
> exit with non-zero status.
>
Is mpirun sticking around after all procs have gone a bug? If not then
what is the use of leaving mpirun hanging around?
> I'm guessing you have that param set in your test due to our earlier
> defining the default to "no abort". I'm content to leave it there, but
> wanted to ensure your tests ran clean.

I don't believe we are setting the env-var which is why I think we have
a regression. It also seems very suspicious to me that both Oracle and
IU are seeing the same condition in MTT. I'll look into this more on
Monday.

--td
>
> On Apr 13, 2012, at 4:32 PM, TERRY DONTJE wrote:
>
>> I could see if less then N processes exit with non-zero exit code
>> that the ORTE may choose not to abort the job. However, if all N
>> processes have exited or aborted I expect everything to clean up and
>> mpirun to exit. It does not do that at the moment which I think is
>> what is causing most of the hangs in the MTT trunk runs which did not
>> occur prior to this week.
>>
>> --td
>>
>> On 4/13/2012 5:18 PM, Ralph Castain wrote:
>>> This has come up again because some of the MTT tests depend on a specific behavior when a process exits with a non-zero status - in this case, they expect ORTE to abort the job. At some point, the default had been switched to NOT abort the job if a process exited with a non-zero status.
>>>
>>> So I'll throw this out to the community: if any process exits with a non-zero status, should ORTE abort the job?
>>>
>>> I don't personally care, but we ought to decide on something. In the meantime, I will set the default so we DO abort, thus allowing the MTT runs to complete correctly.
>>>
>>> FWIW: the MCA param orte_abort_non_zero_exit can always be set to control this behavior.
>>>
>>> Ralph
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>