Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-07-04 17:05:04


I have been noticing this for a while (at least 2 months) as well
along with stale session directories. I filed a bug yesterday #177
   https://svn.open-mpi.org/trac/ompi/ticket/177
I'll add this stack trace to it. I want to take a closer look
tomorrow to see what's really going on here.

When I left it yesterday I found that if you CTRL-C the running
mpirun, and the orted's hang then if you send another signal to
mpirun sometimes mpirun will die from SIGPIPE. This is a race
condition due to the orteds leaving, but we should be masking that
signal or something other than dieing.

So I think there is more than one race in this code, and will need
some serious looking at.

--Josh

On Jul 4, 2006, at 12:38 PM, George Bosilca wrote:

> Starting with few days ago, I notice that more and more orted are
> left over after my runs. Usually, if the job run to completions they
> disappear. But if I kill the job or it segfault they don't. I
> attached to one of them and I get the following stack:
>
> #0 0x9001f7a8 in select ()
> #1 0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664)
> at ../../../ompi-trunk/opal/event/select.c:202
> #2 0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/
> opal/event/event.c:485
> #3 0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/
> orte/mca/iof/base/iof_base_flush.c:111
> #4 0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9,
> cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/
> pls_fork_module.c:175
> #5 0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/
> runtime/orte_wait.c:500
> #6 0x00210ac8 in orte_wait_signal_callback (fd=20, event=8,
> arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366
> #7 0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/
> opal/event/event.c:428
> #8 0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/
> opal/event/event.c:513
> #9 0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/
> opal_progress.c:259
> #10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0)
> at ../../../../../ompi-trunk/opal/threads/condition.h:81
> #11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi-
> trunk/orte/mca/pls/fork/pls_fork_module.c:764
> #12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/
> orte/mca/pls/base/pls_base_close.c:42
> #13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi-
> trunk/orte/mca/rmgr/urm/rmgr_urm.c:521
> #14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/
> orte/mca/rmgr/base/rmgr_base_close.c:39
> #15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/
> runtime/orte_system_finalize.c:65
> #16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/
> orte_finalize.c:42
> #17 0x00002ac8 in main (argc=19, argv=0xbffff17c) at ../../../../ompi-
> trunk/orte/tools/orted/orted.c:377
>
> Somehow, it wait for the pid 9045. But this was one of the kids, and
> it get the SIG_KILL signal (I checked with strace). I wonder if we
> don't have a race condition somewhere on the wait_signal code.
>
> Hope that helps,
> george.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/