Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun exit status
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-19 21:22:18


I believe that this was just fixed in OMPI v1.3.1 -- could you try
upgrading?

On Mar 19, 2009, at 10:58 AM, Cristian KLEIN wrote:

> Hello everybody,
>
> I've been using OpenMPI 1.3's mpirun in Makefiles and observed that
> the
> exit status is not always the one I expect. For example, using an
> incorrect machinefile makes mpirun return 0, whereas a non-zero value
> would be expected:
>
> --- cut here ---
> masternode:~/grid/myTests/hellompi$ env | grep OMPI
> OMPI_MCA_plm_rsh_agent=ssh
> OMPI_MCA_btl_tcp_if_exclude=lo,myri0
> OMPI_MCA_btl=self,tcp
>
> masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile
> hostfile
> ./hellompi.openmpi; echo $?
> ssh: incorrecthost2.example.com: Name or service not known
> ssh: incorrecthost1.example.com: Name or service not known
> [snip]
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> 0
> --- end here ---
>
> The problem comes from the fact that the exitstatus of a process is
> ORed
> with 0xFF and OpenMPI does not take this into consideration. In my
> example, the exit status generated was 65280, which has the lower 8
> bits
> zero.
>
> To solve this problem I suggest the attached patch. If the exitstatus
> would become zero, it replaces it with 111, where 111 is obviously a
> randomly chosen non-zero number. :D
> --- orte/runtime/orte_globals.h.orig 2009-01-09 18:55:22.000000000
> +0100
> +++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734 +0100
> @@ -109,11 +109,14 @@
> #define
> ORTE_UPDATE_EXIT_STATUS(newstatus) \
> do
> { \
> if (0 == orte_exit_status && 0 != newstatus)
> { \
> + if ((newstatus & 0377) == 0) \
> + orte_exit_status = 111; \
> + else \
> + orte_exit_status = newstatus; \
> OPAL_OUTPUT_VERBOSE((1,
> orte_debug_output, \
> "%s:%s(%d) updating exit status to
> %d", \
>
> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
> - __FILE__, __LINE__,
> newstatus)); \
> - orte_exit_status =
> newstatus; \
> + __FILE__, __LINE__,
> orte_exit_status)); \
> } \
> } while(0);
>
> <ATT5772424.txt>

-- 
Jeff Squyres
Cisco Systems