Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: herve PETIT Perso (hpetit_at_[hidden])
Date: 2007-04-05 16:47:16


Some precision about this thread,

I have read the answer you provided for thread "MPI_Comm_Spawn" posted by rozzen.vincent
I have actually reproduced the same behavior on my debian sarge installation
i.e
1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set)
2) MPI applications lock when "--enable-threads" is set

* For issue 1)
MPI 1.2 release solves the problem, so it does not seem to be a system limitation but anyway, now, it is behind us

* For issue 2)
I have been in contact with Rozenn. After a little talk with her, I have done a new test with a "--enable-debug" setting of OpenMpi 1.2 (stable version).

The gdb log is a little bit explicit on the deadlock situation.
-----------------------------------------------------
main*******************************
main : Start MPI*
opal_mutex_lock(): Resource deadlock avoided
[host10:20607] *** Process received signal ***
[host10:20607] Signal: Aborted (6)
[host10:20607] Signal code: (-6)
[host10:20607] [ 0] [0xffffe440]
[host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2]
[host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25]
[host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e]
[host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23]
[host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175]
[host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3]
[host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315]
[host10:20607] [ 8] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a) [0x40061392]
[host10:20607] [ 9] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c) [0x4004dd62]
[host10:20607] [10] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8]
[host10:20607] [11] spawn(main+0xe2) [0x80489a6]
[host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974]
[host10:20607] [13] spawn [0x8048821]
[host10:20607] *** End of error message ***
[host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
------------------------------------------------------------------------------------

So, it seems that the lock is in the spawn code.
I have also discovered that the spawned program is also locked in the spawn mechanism.
Here after, a gdb log from the spawned program.

------------------------------------------------------------------------------------------
#0 0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
#2 0xbffff4b8 in ?? ()
#3 0xbffff4b8 in ?? ()
#4 0x00000000 in ?? ()
#5 0x400a663c in __JCR_LIST__ () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#6 0x400a663c in __JCR_LIST__ () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#7 0x401347a4 in opal_condition_t_class () from /usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0
#8 0xbffff4e8 in ?? ()
#9 0x400554a8 in ompi_proc_construct () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#10 0x400554a8 in ompi_proc_construct () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#11 0x40056946 in ompi_proc_find_and_add () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#12 0x4005609e in ompi_proc_unpack () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#13 0x400481cd in ompi_comm_connect_accept () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#14 0x40049b2a in ompi_comm_dyn_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#15 0x40058e6d in ompi_mpi_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#16 0x4007e122 in PMPI_Init_thread () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#17 0x08048a3b in main (argc=1, argv=0xbffff844) at ExeToSpawned6.c:31
-----------------------------------------------------------------------------------------------

Hopefully, it can help you to investigate.

Herve