Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun exit status
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-20 08:55:04


On Mar 20, 2009, at 4:21 AM, Cristian KLEIN wrote:

> Jeff Squyres a écrit :
>> I believe that this was just fixed in OMPI v1.3.1 -- could you try
>> upgrading?
>
> Yup, the issue is well solved. :)
>
> I would just want to add one thing. Isn't the current solution a
> little
> bit error prone. I mean, instead of having to check before each call
> to
> ORTE_UPDATE_EXIT_STATUS, whether the low 8 bits are indeed non-zero,
> wouldn't it be wiser to have ORTE_UPDATE_EXIT_STATUS do the check?

Because many times we set the exit status with a value that doesn't
come from a process termination, but rather from some internal error
return. In those cases, you can't use the usual OS-specific macros to
test for abnormal termination, so you cannot put the test in the
ORTE_UPDATE_EXIT_STATUS code.

>
>
>>
>> On Mar 19, 2009, at 10:58 AM, Cristian KLEIN wrote:
>>
>>> Hello everybody,
>>>
>>> I've been using OpenMPI 1.3's mpirun in Makefiles and observed
>>> that the
>>> exit status is not always the one I expect. For example, using an
>>> incorrect machinefile makes mpirun return 0, whereas a non-zero
>>> value
>>> would be expected:
>>>
>>> --- cut here ---
>>> masternode:~/grid/myTests/hellompi$ env | grep OMPI
>>> OMPI_MCA_plm_rsh_agent=ssh
>>> OMPI_MCA_btl_tcp_if_exclude=lo,myri0
>>> OMPI_MCA_btl=self,tcp
>>>
>>> masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile
>>> hostfile
>>> ./hellompi.openmpi; echo $?
>>> ssh: incorrecthost2.example.com: Name or service not known
>>> ssh: incorrecthost1.example.com: Name or service not known
>>> [snip]
>>> mpirun noticed that the job aborted, but has no info as to the
>>> process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>>
>>> mpirun: clean termination accomplished
>>>
>>> 0
>>> --- end here ---
>>>
>>> The problem comes from the fact that the exitstatus of a process
>>> is ORed
>>> with 0xFF and OpenMPI does not take this into consideration. In my
>>> example, the exit status generated was 65280, which has the lower
>>> 8 bits
>>> zero.
>>>
>>> To solve this problem I suggest the attached patch. If the
>>> exitstatus
>>> would become zero, it replaces it with 111, where 111 is obviously a
>>> randomly chosen non-zero number. :D
>>> --- orte/runtime/orte_globals.h.orig 2009-01-09
>>> 18:55:22.000000000
>>> +0100
>>> +++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734
>>> +0100
>>> @@ -109,11 +109,14 @@
>>> #define
>>> ORTE_UPDATE_EXIT_STATUS
>>> (newstatus) \
>>> do
>>> { \
>>> if (0 == orte_exit_status && 0 != newstatus)
>>> { \
>>> + if ((newstatus & 0377) ==
>>> 0) \
>>> + orte_exit_status =
>>> 111; \
>>> +
>>> else \
>>> + orte_exit_status =
>>> newstatus; \
>>> OPAL_OUTPUT_VERBOSE((1,
>>> orte_debug_output, \
>>> "%s:%s(%d) updating exit status to
>>> %d", \
>>>
>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
>>> - __FILE__, __LINE__,
>>> newstatus)); \
>>> - orte_exit_status =
>>> newstatus; \
>>> + __FILE__, __LINE__,
>>> orte_exit_status)); \
>>>
>>> } \
>>> } while(0);
>>>
>>> <ATT5772424.txt>
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users