Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Rainer Keller (Keller_at_[hidden])
Date: 2006-01-19 12:23:51


Hi George,
On Thursday 19 January 2006 17:22, George Bosilca wrote:
> I was hopping my patch solve the problem completely ... look like
> it's not the case :( How exactly you get the dead-lock in the
> mpi_test_suite ? Which configure options ? Only --enable-progress-
> threads ?
This happens with both --enable-progress-threads and an additional
--enable-mpi-threads

Both hang in the same places:
Process 0:
#4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6 0x4007d7a2 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at
condition.h:64
#7 0x4007d40b in ompi_request_wait_all (count=1, requests=0x80bc1c0,
statuses=0x0) at req_wait.c:159
#8 0x4073083f in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c90,
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at
coll_tuned_bcast.c:762
#9 0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c90,
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at
coll_tuned_decision_fixed.c:175
#10 0x40083dae in PMPI_Bcast (buffer=0x80c9c90, count=1000,
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74
#12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377
#13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319

Process 1:
#4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6 0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at
condition.h:64
#7 0x406939f2 in mca_pml_ob1_recv (addr=0x80c9c58, count=1000,
datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at
pml_ob1_irecv.c:96
#8 0x407307a4 in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c58,
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at
coll_tuned_bcast.c:729
#9 0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c58,
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at
coll_tuned_decision_fixed.c:175
#10 0x40083dae in PMPI_Bcast (buffer=0x80c9c58, count=1000,
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74
#12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377
#13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319

And yes, when I run with the basic-coll, we also hang ,-]

mpirun -np 2 --mca coll basic ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d
MPI_INT

#4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6 0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at
condition.h:64
#7 0x406939f2 in mca_pml_ob1_recv (addr=0x80c4ca0, count=1000,
datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at
pml_ob1_irecv.c:96
#8 0x4070e402 in mca_coll_basic_bcast_lin_intra (buff=0x80c4ca0, count=1000,
datatype=0x8061de8, root=0, comm=0x80627e0) at coll_basic_bcast.c:57
#9 0x40083dae in PMPI_Bcast (buffer=0x80c4ca0, count=1000,
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#10 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeab0) at tst_coll_bcast.c:74
#11 0x0804bf21 in tst_test_run_func (env=0xbfffeab0) at tst_tests.c:377
#12 0x0804a46a in main (argc=7, argv=0xbfffeb64) at mpi_test_suite.c:319

Now, for what its worth, I ran with helgrind, to check for possible
race-conditions, and it spews out:
==20240== Possible data race writing variable at 0x1D84F46C
==20240== at 0x1DA8BE61: mca_oob_tcp_recv (oob_tcp_recv.c:129)
==20240== by 0x1D73A636: mca_oob_recv_packed (oob_base_recv.c:69)
==20240== by 0x1D73B2B0: mca_oob_xcast (oob_base_xcast.c:133)
==20240== by 0x1D511138: ompi_mpi_init (ompi_mpi_init.c:421)
==20240== Address 0x1D84F46C is 1020 bytes inside a block of size 3168
alloc'd by thread 1
==20240== at 0x1D4A80B4: malloc
(in /usr/lib/valgrind/vgpreload_helgrind.so)
==20240== by 0x1D7DF7BE: opal_free_list_grow (opal_free_list.c:94)
==20240== by 0x1D7DF754: opal_free_list_init (opal_free_list.c:79)
==20240== by 0x1DA815E3: mca_oob_tcp_component_init (oob_tcp.c:530)

So, this was my initial search for whether we may have races in
opal/mpi_free_list....

CU,
Rainer

-- 
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller       email: keller_at_[hidden]
  High Performance Computing     Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)        Fax: ++49 (0)711-685 5832
  POSTAL:Nobelstrasse 19             http://www.hlrs.de/people/keller
  ACTUAL:Allmandring 30, R. O.030      AIM:rusraink
  70550 Stuttgart