Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Intermittent hangs when exiting with error
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-05-29 11:11:07


Ralph:
I am seeing cases where mpirun seems to hang when one of the applications exits with non-zero. For example, the intel test MPI_Cart_get_c will exit that way if there are not enough processes to run the test. In most cases, mpirun seems to return fine with the error code, but sometimes it just hangs. I first started noticing this in my mtt runs. It seems (but not conclusive) that I see this when both the usnic and openib are built, even though I am only using the openib (as I have no usnic hardware).

Anyone else seeing something like this? Note that I see this on both 1.8 and trunk, but I show trunk here.

PASS:
[rvandevaart_at_drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST info (0): Starting MPI_Cart_get test
MPITEST skip (0): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST skip (3): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST skip (2): WARNING -- nodes = 4 Need 6 nodes to run test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

  Process name: [[45854,1],1]
  Exit code: 77
--------------------------------------------------------------------------

FAIL:
[rvandevaart_at_drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST info (0): Starting MPI_Cart_get test
MPITEST skip (0): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST skip (3): WARNING -- nodes = 4 Need 6 nodes to run test
MPITEST skip (2): WARNING -- nodes = 4 Need 6 nodes to run test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[...now we are hung...]

LOCAL mpirun:
[rvandevaart_at_drossetti-ivy0 64-mtt-nocuda]$ pstack 27705 Thread 2 (Thread 0x7fe0c8c47700 (LWP 27706)):
#0 0x00007fe0ca578533 in select () from /lib64/libc.so.6
#1 0x00007fe0c8c5591e in listen_thread () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/openmpi/mca_oob_tcp.so
#2 0x00007fe0ca831851 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fe0ca57f94d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fe0cbcdd700 (LWP 27705)):
#0 0x00007fe0ca576293 in poll () from /lib64/libc.so.6
#1 0x00007fe0cb589575 in poll_dispatch () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2 0x00007fe0cb57df8c in opal_libevent2021_event_base_loop () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3 0x0000000000405572 in orterun ()
#4 0x0000000000403904 in main ()
[rvandevaart_at_drossetti-ivy0 64-mtt-nocuda]$

REMOTE ORTED:
[rvandevaart_at_drossetti-ivy1 ~]$ pstack 10241
#0 0x00007fbdcba7c258 in poll () from /lib64/libc.so.6
#1 0x00007fbdcca8f575 in poll_dispatch () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2 0x00007fbdcca83f8c in opal_libevent2021_event_base_loop () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3 0x00007fbdccd572cc in orte_daemon () from /geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-rte.so.0
#4 0x000000000040094a in main ()
[rvandevaart_at_drossetti-ivy1 ~]$

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------