Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple -- part II
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-02 22:29:00


No surprise there - that's known behavior. As has been said, we hope to
extend the thread-multiple support in the 1.9 series.

On Mon, Dec 2, 2013 at 6:33 PM, Eric Chamberland <
Eric.Chamberland_at_[hidden]> wrote:

> Hi,
>
> I just open a new "chapter" with the same subject. ;-)
>
> We are using OpenMPI 1.6.5 (compiled with --enable-thread-multiple) with
> Petsc 3.4.3 (on colosse supercomputer: http://www.calculquebec.ca/en/
> resources/compute-servers/colosse). We observed a deadlock with threads
> within the openib btl.
>
> We successfully bypassed the deadlock by 2 different ways:
>
> #1- launching the code with "--mca btl ^openib"
>
> #2- compiling OpenMPI 1.6.5 *without* the "--enable-thread-multiple"
> option.
>
> When the code hangs, here are some backtraces (on different processes)
> that we got:
>
> #0 0x00007fb4a6a03795 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1 0x00007fb49db7ea7b in ?? () from /usr/lib64/libmlx4-rdmav2.so
> #2 0x00007fb4a878d469 in ibv_poll_cq () at
> /usr/include/infiniband/verbs.h:884
> #3 poll_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3563
>
> #4 progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3694
>
> #5 btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #6 0x00007fb4a8973d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #7 0x00007fb4a87404f0 in opal_condition_wait (count=25695904,
> requests=0x100, statuses=0x7fff9b7f1320) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #8 ompi_request_default_wait_all (count=25695904, requests=0x100,
> statuses=0x7fff9b7f1320) at ../../openmpi-1.6.5/ompi/
> request/req_wait.c:263
>
>
>
>
> #0 0x00007f731d1100b8 in pthread_mutex_unlock () from
> /lib64/libpthread.so.0
> #1 0x00007f731ee9b3b7 in opal_mutex_unlock () at
> ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
> #2 progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688
>
> #3 btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #4 0x00007f731f081d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #5 0x00007f731ee4e4f0 in opal_condition_wait (count=25649104,
> requests=0x0, statuses=0x1875fd0) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #6 ompi_request_default_wait_all (count=25649104, requests=0x0,
> statuses=0x1875fd0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
> #7 0x00007f731eec2644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1875fd0,
> rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, op=0x1875fd0,
> comm=0x5e80, module=0xca4ec20)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #8 0x00007f731eebe2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x1875fd0, rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc,
> op=0x1875fd0, comm=0x5e80, module=0x159d8330)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #9 0x00007f731ee5cad9 in PMPI_Allreduce (sendbuf=0x1875fd0,
> recvbuf=0x0, count=25649104, datatype=0x7f72ce8f80fc, op=0x1875fd0,
> comm=0x5e80) at pallreduce.c:105
>
>
>
> #0 opal_progress () at ../../openmpi-1.6.5/opal/
> runtime/opal_progress.c:206
> #1 0x00007f8e3d8844f0 in opal_condition_wait (count=0, requests=0x0,
> statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/opal/
> threads/condition.h:92
> #2 ompi_request_default_wait_all (count=0, requests=0x0,
> statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/ompi/
> request/req_wait.c:263
> #3 0x00007f8e3d8f8644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x0, rbuf=0x0,
> count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, module=0xcb86ce0)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #4 0x00007f8e3d8f42ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x0, rbuf=0x0, count=1037994528, dtype=0x1, op=0x0, comm=0x60bb,
> module=0x171d59a0)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #5 0x00007f8e3d892ad9 in PMPI_Allreduce (sendbuf=0x0, recvbuf=0x0,
> count=1037994528, datatype=0x1, op=0x0, comm=0x60bb) at pallreduce.c:105
>
>
>
> #0 0x00007f7ef7d0b258 in pthread_mutex_lock_at_plt () from
> /software/MPI/openmpi/1.6.5_intel/lib/libmpi.so.1
> #1 0x00007f7ef7d72377 in opal_mutex_lock () at
> ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:109
> #2 progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3650
>
> #3 btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #4 0x00007f7ef7f58d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #5 0x00007f7ef7d254f0 in opal_condition_wait (count=25625488,
> requests=0x0, statuses=0x7f7ef8324208) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #6 ompi_request_default_wait_all (count=25625488, requests=0x0,
> statuses=0x7f7ef8324208) at ../../openmpi-1.6.5/ompi/
> request/req_wait.c:263
> #7 0x00007f7ef7d99644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1870390,
> rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0, comm=0x60bc,
> module=0xca6a360)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #8 0x00007f7ef7d952ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x1870390, rbuf=0x0, count=-130924024, dtype=0x0, op=0x1874cb0,
> comm=0x60bc, module=0x14512a20)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #9 0x00007f7ef7d33ad9 in PMPI_Allreduce (sendbuf=0x1870390,
> recvbuf=0x0, count=-130924024, datatype=0x0, op=0x1874cb0, comm=0x60bc)
> at pallreduce.c:105
>
>
>
>
> #0 0x00007f1fe7bcd0b8 in pthread_mutex_unlock () from
> /lib64/libpthread.so.0
> #1 0x00007f1fe99583b7 in opal_mutex_unlock () at
> ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
> #2 progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688
>
> #3 btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #4 0x00007f1fe9b3ed32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #5 0x00007f1fe990b4f0 in opal_condition_wait (count=25659568,
> requests=0x0, statuses=0x18788b0) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #6 ompi_request_default_wait_all (count=25659568, requests=0x0,
> statuses=0x18788b0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
> #7 0x00007f1fe997f644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x18788b0,
> rbuf=0x0, count=25659568, dtype=0x7f1f9949727c, op=0x18788b0,
> comm=0x3db6, module=0xccda900)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #8 0x00007f1fe997b2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x18788b0, rbuf=0x0, count=25659568, dtype=0x7f1f9949727c,
> op=0x18788b0, comm=0x3db6, module=0x170dbf00)
> at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #9 0x00007f1fe9919ad9 in PMPI_Allreduce (sendbuf=0x18788b0,
> recvbuf=0x0, count=25659568, datatype=0x7f1f9949727c, op=0x18788b0,
> comm=0x3db6) at pallreduce.c:105
>
> Attached, is "ompi_info -all" output.
>
> here is the command line:
>
> "mpiexec -mca mpi_show_mca_params all -mca oob_tcp_peer_retries 1000
> --output-filename PneuSurfaceLibre.out --timestamp-output
> --report-bindings -mca orte_num_sockets 2 -mca orte_num_cores 4
> --bind-to-socket -npersocket 1 our_housecode_executable_based_on_petsc_343
> and_parameters..."
>
> Hope it can help to debug!
>
> Thanks!
>
> Eric
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>