Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-04-05 17:24:20


Thanks Herve - and Rozenn too.

I can't speak to the thread lock issue as it appears to be occurring in the
MPI side of the code.

As to the spawn limit, I honestly never checked the 1.1.x code family as we
aren't planning any repairs to it anyway. My observations were based on the
1.2 family. We have done our own fairly extensive testing and found there
are system-imposed limits that do cause problems, but that the levels at
which these occur are *very* system dependent - i.e., they depend upon
kernel configuration parameters that vary across releases, how your system
admin configured things, etc. They are, therefore, impossible to predict.

What we are going to do is modify the code so we can at least detect these
situations, alert you to them, and gracefully exit when we encounter them.
Hopefully, we'll have those fixes out soon.

Thanks again
Ralph

On 4/5/07 2:47 PM, "herve PETIT Perso" <hpetit_at_[hidden]> wrote:

> Some precision about this thread,
>
> I have read the answer you provided for thread "MPI_Comm_Spawn" posted by
> rozzen.vincent
> I have actually reproduced the same behavior on my debian sarge installation
> i.e
> 1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set)
> 2) MPI applications lock when "--enable-threads" is set
>
> * For issue 1)
> MPI 1.2 release solves the problem, so it does not seem to be a system
> limitation but anyway, now, it is behind us
>
> * For issue 2)
> I have been in contact with Rozenn. After a little talk with her, I have done
> a new test with a "--enable-debug" setting of OpenMpi 1.2 (stable version).
>
> The gdb log is a little bit explicit on the deadlock situation.
> -----------------------------------------------------
> main*******************************
> main : Start MPI*
> opal_mutex_lock(): Resource deadlock avoided
> [host10:20607] *** Process received signal ***
> [host10:20607] Signal: Aborted (6)
> [host10:20607] Signal code: (-6)
> [host10:20607] [ 0] [0xffffe440]
> [host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2]
> [host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25]
> [host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e]
> [host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23]
> [host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175]
> [host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3]
> [host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315]
> [host10:20607] [ 8]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a)
> [0x40061392]
> [host10:20607] [ 9]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c)
> [0x4004dd62]
> [host10:20607] [10]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8]
> [host10:20607] [11] spawn(main+0xe2) [0x80489a6]
> [host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974]
> [host10:20607] [13] spawn [0x8048821]
> [host10:20607] *** End of error message ***
> [host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> ------------------------------------------------------------------------------
> ------
>
>
> So, it seems that the lock is in the spawn code.
> I have also discovered that the spawned program is also locked in the spawn
> mechanism.
> Here after, a gdb log from the spawned program.
>
>
> ------------------------------------------------------------------------------
> ------------
> #0 0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1 0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2 0xbffff4b8 in ?? ()
> #3 0xbffff4b8 in ?? ()
> #4 0x00000000 in ?? ()
> #5 0x400a663c in __JCR_LIST__ () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #6 0x400a663c in __JCR_LIST__ () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #7 0x401347a4 in opal_condition_t_class () from
> /usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0
> #8 0xbffff4e8 in ?? ()
> #9 0x400554a8 in ompi_proc_construct () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #10 0x400554a8 in ompi_proc_construct () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #11 0x40056946 in ompi_proc_find_and_add () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #12 0x4005609e in ompi_proc_unpack () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #13 0x400481cd in ompi_comm_connect_accept () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #14 0x40049b2a in ompi_comm_dyn_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #15 0x40058e6d in ompi_mpi_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #16 0x4007e122 in PMPI_Init_thread () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #17 0x08048a3b in main (argc=1, argv=0xbffff844) at ExeToSpawned6.c:31
> ------------------------------------------------------------------------------
> -----------------
>
> Hopefully, it can help you to investigate.
>
>
>
> Herve
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users