Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple -- part II
From: Eric Chamberland (Eric.Chamberland_at_[hidden])
Date: 2013-12-02 21:33:07


Hi,

I just open a new "chapter" with the same subject. ;-)

We are using OpenMPI 1.6.5 (compiled with --enable-thread-multiple) with
Petsc 3.4.3 (on colosse supercomputer:
http://www.calculquebec.ca/en/resources/compute-servers/colosse). We
observed a deadlock with threads within the openib btl.

We successfully bypassed the deadlock by 2 different ways:

#1- launching the code with "--mca btl ^openib"

#2- compiling OpenMPI 1.6.5 *without* the "--enable-thread-multiple" option.

When the code hangs, here are some backtraces (on different processes)
that we got:

#0 0x00007fb4a6a03795 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00007fb49db7ea7b in ?? () from /usr/lib64/libmlx4-rdmav2.so
#2 0x00007fb4a878d469 in ibv_poll_cq () at
/usr/include/infiniband/verbs.h:884
#3 poll_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3563

#4 progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3694

#5 btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719

#6 0x00007fb4a8973d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#7 0x00007fb4a87404f0 in opal_condition_wait (count=25695904,
requests=0x100, statuses=0x7fff9b7f1320) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#8 ompi_request_default_wait_all (count=25695904, requests=0x100,
statuses=0x7fff9b7f1320) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263

#0 0x00007f731d1100b8 in pthread_mutex_unlock () from
/lib64/libpthread.so.0
#1 0x00007f731ee9b3b7 in opal_mutex_unlock () at
../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
#2 progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688

#3 btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719

#4 0x00007f731f081d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#5 0x00007f731ee4e4f0 in opal_condition_wait (count=25649104,
requests=0x0, statuses=0x1875fd0) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#6 ompi_request_default_wait_all (count=25649104, requests=0x0,
statuses=0x1875fd0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#7 0x00007f731eec2644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1875fd0,
rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, op=0x1875fd0,
comm=0x5e80, module=0xca4ec20)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#8 0x00007f731eebe2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x1875fd0, rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc,
op=0x1875fd0, comm=0x5e80, module=0x159d8330)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61

#9 0x00007f731ee5cad9 in PMPI_Allreduce (sendbuf=0x1875fd0,
recvbuf=0x0, count=25649104, datatype=0x7f72ce8f80fc, op=0x1875fd0,
comm=0x5e80) at pallreduce.c:105

#0 opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:206
#1 0x00007f8e3d8844f0 in opal_condition_wait (count=0, requests=0x0,
statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/opal/threads/condition.h:92
#2 ompi_request_default_wait_all (count=0, requests=0x0,
statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#3 0x00007f8e3d8f8644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x0, rbuf=0x0,
count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, module=0xcb86ce0)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#4 0x00007f8e3d8f42ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x0, rbuf=0x0, count=1037994528, dtype=0x1, op=0x0, comm=0x60bb,
module=0x171d59a0)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61

#5 0x00007f8e3d892ad9 in PMPI_Allreduce (sendbuf=0x0, recvbuf=0x0,
count=1037994528, datatype=0x1, op=0x0, comm=0x60bb) at pallreduce.c:105

#0 0x00007f7ef7d0b258 in pthread_mutex_lock_at_plt () from
/software/MPI/openmpi/1.6.5_intel/lib/libmpi.so.1
#1 0x00007f7ef7d72377 in opal_mutex_lock () at
../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:109
#2 progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3650

#3 btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719

#4 0x00007f7ef7f58d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#5 0x00007f7ef7d254f0 in opal_condition_wait (count=25625488,
requests=0x0, statuses=0x7f7ef8324208) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#6 ompi_request_default_wait_all (count=25625488, requests=0x0,
statuses=0x7f7ef8324208) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#7 0x00007f7ef7d99644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1870390,
rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0, comm=0x60bc,
module=0xca6a360)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#8 0x00007f7ef7d952ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x1870390, rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0,
comm=0x60bc, module=0x14512a20)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61

#9 0x00007f7ef7d33ad9 in PMPI_Allreduce (sendbuf=0x1870390,
recvbuf=0x0, count=-130924024, datatype=0x0, op=0x1874cb0, comm=0x60bc)
at pallreduce.c:105

#0 0x00007f1fe7bcd0b8 in pthread_mutex_unlock () from
/lib64/libpthread.so.0
#1 0x00007f1fe99583b7 in opal_mutex_unlock () at
../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
#2 progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688

#3 btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719

#4 0x00007f1fe9b3ed32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#5 0x00007f1fe990b4f0 in opal_condition_wait (count=25659568,
requests=0x0, statuses=0x18788b0) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#6 ompi_request_default_wait_all (count=25659568, requests=0x0,
statuses=0x18788b0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#7 0x00007f1fe997f644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x18788b0,
rbuf=0x0, count=25659568, dtype=0x7f1f9949727c, op=0x18788b0,
comm=0x3db6, module=0xccda900)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#8 0x00007f1fe997b2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x18788b0, rbuf=0x0, count=25659568, dtype=0x7f1f9949727c,
op=0x18788b0, comm=0x3db6, module=0x170dbf00)
       at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61

#9 0x00007f1fe9919ad9 in PMPI_Allreduce (sendbuf=0x18788b0,
recvbuf=0x0, count=25659568, datatype=0x7f1f9949727c, op=0x18788b0,
comm=0x3db6) at pallreduce.c:105

Attached, is "ompi_info -all" output.

here is the command line:

"mpiexec -mca mpi_show_mca_params all -mca oob_tcp_peer_retries 1000
--output-filename PneuSurfaceLibre.out --timestamp-output
--report-bindings -mca orte_num_sockets 2 -mca orte_num_cores 4
--bind-to-socket -npersocket 1
our_housecode_executable_based_on_petsc_343 and_parameters..."

Hope it can help to debug!

Thanks!

Eric