Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun exit status
From: Cristian KLEIN (cristiklein_at_[hidden])
Date: 2009-03-20 06:21:00


Jeff Squyres a écrit :
> I believe that this was just fixed in OMPI v1.3.1 -- could you try
> upgrading?

Yup, the issue is well solved. :)

I would just want to add one thing. Isn't the current solution a little
bit error prone. I mean, instead of having to check before each call to
ORTE_UPDATE_EXIT_STATUS, whether the low 8 bits are indeed non-zero,
wouldn't it be wiser to have ORTE_UPDATE_EXIT_STATUS do the check?

>
> On Mar 19, 2009, at 10:58 AM, Cristian KLEIN wrote:
>
>> Hello everybody,
>>
>> I've been using OpenMPI 1.3's mpirun in Makefiles and observed that the
>> exit status is not always the one I expect. For example, using an
>> incorrect machinefile makes mpirun return 0, whereas a non-zero value
>> would be expected:
>>
>> --- cut here ---
>> masternode:~/grid/myTests/hellompi$ env | grep OMPI
>> OMPI_MCA_plm_rsh_agent=ssh
>> OMPI_MCA_btl_tcp_if_exclude=lo,myri0
>> OMPI_MCA_btl=self,tcp
>>
>> masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile hostfile
>> ./hellompi.openmpi; echo $?
>> ssh: incorrecthost2.example.com: Name or service not known
>> ssh: incorrecthost1.example.com: Name or service not known
>> [snip]
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>>
>> mpirun: clean termination accomplished
>>
>> 0
>> --- end here ---
>>
>> The problem comes from the fact that the exitstatus of a process is ORed
>> with 0xFF and OpenMPI does not take this into consideration. In my
>> example, the exit status generated was 65280, which has the lower 8 bits
>> zero.
>>
>> To solve this problem I suggest the attached patch. If the exitstatus
>> would become zero, it replaces it with 111, where 111 is obviously a
>> randomly chosen non-zero number. :D
>> --- orte/runtime/orte_globals.h.orig 2009-01-09 18:55:22.000000000
>> +0100
>> +++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734 +0100
>> @@ -109,11 +109,14 @@
>> #define
>> ORTE_UPDATE_EXIT_STATUS(newstatus) \
>> do
>> { \
>> if (0 == orte_exit_status && 0 != newstatus)
>> { \
>> + if ((newstatus & 0377) ==
>> 0) \
>> + orte_exit_status =
>> 111; \
>> +
>> else \
>> + orte_exit_status =
>> newstatus; \
>> OPAL_OUTPUT_VERBOSE((1,
>> orte_debug_output, \
>> "%s:%s(%d) updating exit status to
>> %d", \
>>
>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
>> - __FILE__, __LINE__,
>> newstatus)); \
>> - orte_exit_status =
>> newstatus; \
>> + __FILE__, __LINE__,
>> orte_exit_status)); \
>>
>> } \
>> } while(0);
>>
>> <ATT5772424.txt>
>
>