Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-01-30 12:57:00


Jeremy,

You're right. Thanks for point it out. I do the change in the trunk.

   george.

On Jan 30, 2007, at 3:40 AM, Jeremy Buisson wrote:

> Dear Open MPI users list,
>
> From time to time, I experience a mutex deadlock in Open-MPI 1.1.2.
> The stack
> trace is available at the end of the mail. The deadlock seems to be
> caused by
> lines 118 & 119 of the ompi/mca/btl/tcp/btl_tcp.c file, in function
> mca_btl_tcp_add_procs:
> OBJ_RELEASE(tcp_endpoint);
> OPAL_THREAD_UNLOCK(&tcp_proc->proc_lock);
> (of course, I did not check whether line numbers have changed since
> 1.1.2.)
> Indeed, releasing tcp_endpoint causes a call to
> mca_btl_tcp_proc_remove that
> attempts to acquire the mutex tcp_proc->proc_lock, which is already
> held by the
> thread (OBJ_THREAD_LOCK(&tcp_proc->proc_lock) at line 103 of the
> ompi/mca/btl/tcp/btl_tcp.c file). Switching the two lines above (ie
> releasing
> the mutex before destructing tcp_endpoint) seems to be sufficient
> to fix the
> deadlock. Maybe should the changes done in the
> mca_btl_tcp_proc_insert function
> be reverted rather than releasing the mutex before tcp_endpoint?
> As far as I looked, the problem seems to still appear in the trunk
> revision 13359.
>
> Second point. Is there any reason why MPI_Comm_spawn is restricted
> to execute
> the new process(es) only on hosts listed in either the --host
> option or in the
> hostfile? Or did I miss something?
>
> Best regards,
> Jeremy
>
> ----------------------------------------------------------------------
> --------
> stack trace as dumped by open-mpi (gdb version follows):
> opal_mutex_lock(): Resource deadlock avoided
> Signal:6 info.si_errno:0(Success) si_code:-6()
> [0] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libopal.so.0
> [0x8addeb]
> [1] func:/lib/tls/libpthread.so.0 [0x176e40]
> [2] func:/lib/tls/libc.so.6(abort+0x1d5) [0xa294e5]
> [3] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
> [0x65f8a3]
> [4]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
> (mca_btl_tcp_proc_remove+0x2a)
> [0x65fff0]
> [5] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
> [0x65cb24]
> [6] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
> [0x659465]
> [7]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so
> (mca_btl_tcp_add_procs+0x10f)
> [0x65927b]
> [8]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2_add_procs+0x1bb)
> [0x628023]
> [9]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_pml_ob1.so
> (mca_pml_ob1_add_procs+0xd6)
> [0x61650b]
> [10]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0
> (ompi_comm_get_rport+0x1f8)
> [0xb82303]
> [11]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0
> (ompi_comm_connect_accept+0xbb)
> [0xb81b43]
> [12]
> func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0
> (PMPI_Comm_spawn+0x3de)
> [0xbb671a]
> [13]
> func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0
> +0x3d2)
> [0x804bb8e]
> [14] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bdff]
> [15] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bfd4]
> [16] func:/lib/tls/libc.so.6(__libc_start_main+0xda) [0xa1578a]
> [17]
> func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0+0x75)
> [0x804b831]
> *** End of error message ***
>
>
> Same stack, dumped by gdb:
> #0 0x00176357 in __pause_nocancel () from /lib/tls/libpthread.so.0
> #1 0x008ade9b in opal_show_stackframe (signo=6, info=0xbfff9290,
> p=0xbfff9310) at stacktrace.c:306
> #2 <signal handler called>
> #3 0x00a27cdf in raise () from /lib/tls/libc.so.6
> #4 0x00a294e5 in abort () from /lib/tls/libc.so.6
> #5 0x0065f8a3 in opal_mutex_lock (m=0x8ff8250) at
> ../../../../opal/threads/mutex_unix.h:104
> #6 0x0065fff0 in mca_btl_tcp_proc_remove (btl_proc=0x8ff8220,
> btl_endpoint=0x900eba0) at btl_tcp_proc.c:296
> #7 0x0065cb24 in mca_btl_tcp_endpoint_destruct
> (endpoint=0x900eba0) at
> btl_tcp_endpoint.c:99
> #8 0x00659465 in opal_obj_run_destructors (object=0x900eba0) at
> ../../../../opal/class/opal_object.h:405
> #9 0x0065927b in mca_btl_tcp_add_procs (btl=0x8e57c30, nprocs=1,
> ompi_procs=0x8ff7ac8, peers=0x8ff7ad8, reachable=0xbfff98e4) at
> btl_tcp.c:118
> #10 0x00628023 in mca_bml_r2_add_procs (nprocs=1, procs=0x8ff7ac8,
> bml_endpoints=0x8ff60b8, reachable=0xbfff98e4) at bml_r2.c:231
> #11 0x0061650b in mca_pml_ob1_add_procs (procs=0xbfff9930,
> nprocs=1) at
> pml_ob1.c:133
> #12 0x00b82303 in ompi_comm_get_rport (port=0x0, send_first=0,
> proc=0x8e51c70, tag=2000) at communicator/comm_dyn.c:305
> #13 0x00b81b43 in ompi_comm_connect_accept (comm=0x8ff8ce0, root=0,
> port=0x0, send_first=0, newcomm=0xbfff9a38, tag=2000) at
> communicator/comm_dyn.c:85
> #14 0x00bb671a in PMPI_Comm_spawn (command=0x8ff88f0
> "/home1/jbuisson/target/bin/sample-npb-ft-pp", argv=0xbfff9b40,
> maxprocs=1, info=0x8ff73e0, root=0,
> comm=0x8ff8ce0, intercomm=0xbfff9aa4, array_of_errcodes=0x0) at
> pcomm_spawn.c:110
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users