Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeremy Buisson (jbuisson_at_[hidden])
Date: 2007-01-30 03:40:38


Dear Open MPI users list,

From time to time, I experience a mutex deadlock in Open-MPI 1.1.2. The stack
trace is available at the end of the mail. The deadlock seems to be caused by
lines 118 & 119 of the ompi/mca/btl/tcp/btl_tcp.c file, in function
mca_btl_tcp_add_procs:
            OBJ_RELEASE(tcp_endpoint);
            OPAL_THREAD_UNLOCK(&tcp_proc->proc_lock);
(of course, I did not check whether line numbers have changed since 1.1.2.)
Indeed, releasing tcp_endpoint causes a call to mca_btl_tcp_proc_remove that
attempts to acquire the mutex tcp_proc->proc_lock, which is already held by the
thread (OBJ_THREAD_LOCK(&tcp_proc->proc_lock) at line 103 of the
ompi/mca/btl/tcp/btl_tcp.c file). Switching the two lines above (ie releasing
the mutex before destructing tcp_endpoint) seems to be sufficient to fix the
deadlock. Maybe should the changes done in the mca_btl_tcp_proc_insert function
be reverted rather than releasing the mutex before tcp_endpoint?
As far as I looked, the problem seems to still appear in the trunk revision 13359.

Second point. Is there any reason why MPI_Comm_spawn is restricted to execute
the new process(es) only on hosts listed in either the --host option or in the
hostfile? Or did I miss something?

Best regards,
Jeremy

------------------------------------------------------------------------------
stack trace as dumped by open-mpi (gdb version follows):
opal_mutex_lock(): Resource deadlock avoided
Signal:6 info.si_errno:0(Success) si_code:-6()
[0] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libopal.so.0 [0x8addeb]
[1] func:/lib/tls/libpthread.so.0 [0x176e40]
[2] func:/lib/tls/libc.so.6(abort+0x1d5) [0xa294e5]
[3] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
[0x65f8a3]
[4]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0x2a)
[0x65fff0]
[5] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
[0x65cb24]
[6] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
[0x659465]
[7]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x10f)
[0x65927b]
[8]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x1bb)
[0x628023]
[9]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd6)
[0x61650b]
[10]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(ompi_comm_get_rport+0x1f8)
[0xb82303]
[11]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(ompi_comm_connect_accept+0xbb)
[0xb81b43]
[12]
func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(PMPI_Comm_spawn+0x3de)
[0xbb671a]
[13]
func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0+0x3d2)
[0x804bb8e]
[14] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bdff]
[15] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bfd4]
[16] func:/lib/tls/libc.so.6(__libc_start_main+0xda) [0xa1578a]
[17]
func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0+0x75)
[0x804b831]
*** End of error message ***

Same stack, dumped by gdb:
#0 0x00176357 in __pause_nocancel () from /lib/tls/libpthread.so.0
#1 0x008ade9b in opal_show_stackframe (signo=6, info=0xbfff9290,
p=0xbfff9310) at stacktrace.c:306
#2 <signal handler called>
#3 0x00a27cdf in raise () from /lib/tls/libc.so.6
#4 0x00a294e5 in abort () from /lib/tls/libc.so.6
#5 0x0065f8a3 in opal_mutex_lock (m=0x8ff8250) at
../../../../opal/threads/mutex_unix.h:104
#6 0x0065fff0 in mca_btl_tcp_proc_remove (btl_proc=0x8ff8220,
btl_endpoint=0x900eba0) at btl_tcp_proc.c:296
#7 0x0065cb24 in mca_btl_tcp_endpoint_destruct (endpoint=0x900eba0) at
btl_tcp_endpoint.c:99
#8 0x00659465 in opal_obj_run_destructors (object=0x900eba0) at
../../../../opal/class/opal_object.h:405
#9 0x0065927b in mca_btl_tcp_add_procs (btl=0x8e57c30, nprocs=1,
ompi_procs=0x8ff7ac8, peers=0x8ff7ad8, reachable=0xbfff98e4) at
btl_tcp.c:118
#10 0x00628023 in mca_bml_r2_add_procs (nprocs=1, procs=0x8ff7ac8,
bml_endpoints=0x8ff60b8, reachable=0xbfff98e4) at bml_r2.c:231
#11 0x0061650b in mca_pml_ob1_add_procs (procs=0xbfff9930, nprocs=1) at
pml_ob1.c:133
#12 0x00b82303 in ompi_comm_get_rport (port=0x0, send_first=0,
proc=0x8e51c70, tag=2000) at communicator/comm_dyn.c:305
#13 0x00b81b43 in ompi_comm_connect_accept (comm=0x8ff8ce0, root=0,
port=0x0, send_first=0, newcomm=0xbfff9a38, tag=2000) at
communicator/comm_dyn.c:85
#14 0x00bb671a in PMPI_Comm_spawn (command=0x8ff88f0
"/home1/jbuisson/target/bin/sample-npb-ft-pp", argv=0xbfff9b40,
maxprocs=1, info=0x8ff73e0, root=0,
    comm=0x8ff8ce0, intercomm=0xbfff9aa4, array_of_errcodes=0x0) at
pcomm_spawn.c:110