Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI deadlocks and race conditions ?
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-14 20:06:07


François PELLEGRINI wrote:

>I sometimes run into deadlocks in OpenMPI (1.3.3a1r21206), when
>running my MPI+threaded PT-Scotch software.
>
So, are there multiple threads per process that perform message-passing
operations?

Other comments below.

>Luckily, the case
>is very small, with 4 procs only, so I have been able to investigate
>it a bit. It seems that matches between commnications are not done
>properly on cloned communicators. In the end, I run into a case where
>a MPI_Waitall completes a MPI_Barrier on another proc. The bug is
>erratic but quite easy to reproduce, luckily too.
>
>To be sure, I ran my code into valgrind using helgrind, its
>race condition detection tool. It produced much output, most
>of which seems to be innocuous, yet I have some concerns about
>such messages as the following ones. The ==12**== were generated
>when running on 4 procs, while the ==83**== were generated
>when running on 2 procs :
>
>==8329== Possible data race during write of size 4 at 0x8882200
>==8329== at 0x508B315: sm_fifo_write (btl_sm.h:254)
>==8329== by 0x508B401: mca_btl_sm_send (btl_sm.c:811)
>==8329== by 0x5070A0C: mca_bml_base_send_status (bml.h:288)
>==8329== by 0x50708E6: mca_pml_ob1_send_request_start_copy (pml_ob1_sendreq.c:567)
>==8329== by 0x5064C30: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:363)
>==8329== by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
>==8329== by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
>==8329== by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
>==8329== by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs (coll_tuned_barrier.c:258)
>==8329== by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:192)
>==8329== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==8329== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==8329== Old state: shared-readonly by threads #1, #7
>==8329== New state: shared-modified by threads #1, #7
>==8329== Reason: this thread, #1, holds no consistent locks
>==8329== Location 0x8882200 has never been protected by any lock
>
>
This seems to be where the "head" index is incremented in
sm_fifo_write(). I believe that function is only ever called via the
macro MCA_BTL_SM_FIFO_WRITE, which requires the writer to be holding the
FIFO's head lock. So, this would seem to be sufficiently protected. In
1.3.1 and earlier, a lock was required only for multithreaded programs.
Now, the writer *always* has to acquire the lock since the FIFOs are
shared among senders.

>==1220== Possible data race during write of size 4 at 0x88CEF88
>==1220== at 0x508CD84: sm_fifo_read (btl_sm.h:272)
>==1220== by 0x508C864: mca_btl_sm_component_progress (btl_sm_component.c:391)
>==1220== by 0x41F72DF: opal_progress (opal_progress.c:207)
>==1220== by 0x40BD67D: opal_condition_wait (condition.h:85)
>==1220== by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
>==1220== by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
>==1220== by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling (coll_tuned_barrier.c:174)
>==1220== by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:208)
>==1220== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==1220== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==1220== by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
>==1220== by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
>==1220== Old state: shared-readonly by threads #1, #7
>==1220== New state: shared-modified by threads #1, #7
>==1220== Reason: this thread, #1, holds no consistent locks
>==1220== Location 0x88CEF88 has never been protected by any lock
>
>
Here, the FIFO tail index is being incremented in sm_fifo_read(). I
believe this function is only called from
mca_btl_sm_component_progress(). That function requires that the reader
holds the tail lock to read the tail when the process is multithreaded.
I believe this requirement suffices since only the reader/owner of the
FIFO can read the tail. So, the only contention would be if that
reader/owner is multithreaded.

>==1219== Possible data race during write of size 4 at 0x891BC8C
>==1219== at 0x508CD99: sm_fifo_read (btl_sm.h:273)
>==1219== by 0x508C864: mca_btl_sm_component_progress (btl_sm_component.c:391)
>==1219== by 0x41F72DF: opal_progress (opal_progress.c:207)
>==1219== by 0x40BD67D: opal_condition_wait (condition.h:85)
>==1219== by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
>==1219== by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
>==1219== by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling (coll_tuned_barrier.c:174)
>==1219== by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:208)
>==1219== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==1219== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==1219== by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
>==1219== by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
>==1219== Old state: shared-readonly by threads #1, #7
>==1219== New state: shared-modified by threads #1, #7
>==1219== Reason: this thread, #1, holds no consistent locks
>==1219== Location 0x891BC8C has never been protected by any lock
>
>
This immediately follows the incrementing of the tail index and is
governed by the same tail lock when the process is multi-threaded.

>==1220== Possible data race during write of size 4 at 0x4243A68
>==1220== at 0x41F72A7: opal_progress (opal_progress.c:186)
>==1220== by 0x40BD67D: opal_condition_wait (condition.h:85)
>==1220== by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
>==1220== by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
>==1220== by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling (coll_tuned_barrier.c:174)
>==1220== by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:208)
>==1220== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==1220== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==1220== by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
>==1220== by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
>==1220== by 0x805EB86: _SCOTCHkdgraphMapRbPart (kdgraph_map_rb_part.c:421)
>==1220== by 0x8057713: _SCOTCHkdgraphMapSt (kdgraph_map_st.c:182)
>==1220== Old state: shared-readonly by threads #1, #7
>==1220== New state: shared-modified by threads #1, #7
>==1220== Reason: this thread, #1, holds no consistent locks
>==1220== Location 0x4243A68 has never been protected by any lock
>
>
I guess I won't venture any comments on the opal progress engine.

>==8328== Possible data race during write of size 4 at 0x4532318
>==8328== at 0x508A9B8: opal_atomic_lifo_pop (opal_atomic_lifo.h:111)
>==8328== by 0x508A69F: mca_btl_sm_alloc (btl_sm.c:612)
>==8328== by 0x5070571: mca_bml_base_alloc (bml.h:241)
>==8328== by 0x5070778: mca_pml_ob1_send_request_start_copy (pml_ob1_sendreq.c:506)
>==8328== by 0x5064C30: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:363)
>==8328== by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
>==8328== by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
>==8328== by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
>==8328== by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs (coll_tuned_barrier.c:258)
>==8328== by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:192)
>==8328== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==8328== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==8328== Old state: shared-readonly by threads #1, #8
>==8328== New state: shared-modified by threads #1, #8
>==8328== Reason: this thread, #1, holds no consistent locks
>==8328== Location 0x4532318 has never been protected by any lock
>
>
Here, opal_atomic_lifo_pop is used to get an item off the sm eager free
list. The opal atomic LIFO operation seems to use atomic memory
operations for thread safety, but I'll let someone else vouch for that code.

>==8329== Possible data race during write of size 4 at 0x452F238
>==8329== at 0x5067FD3: recv_req_matched (pml_ob1_recvreq.h:219)
>==8329== by 0x5067D95: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:191)
>==8329== by 0x508C9BB: mca_btl_sm_component_progress (btl_sm_component.c:426)
>==8329== by 0x41F72DF: opal_progress (opal_progress.c:207)
>==8329== by 0x40BD67D: opal_condition_wait (condition.h:85)
>==8329== by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
>==8329== by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
>==8329== by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs (coll_tuned_barrier.c:258)
>==8329== by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:192)
>==8329== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==8329== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==8329== by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
>==8329== Old state: owned exclusively by thread #7
>==8329== New state: shared-modified by threads #1, #7
>==8329== Reason: this thread, #1, holds no locks at all
>
>
Dunno. Here, the PML is copying source and tag information out of a
match header into a status structure. I would think this code is okay
since the thread presumably owns both the receive request and the match
header. But I'll let someone who knows the PML speak up on this point.

>==8329== Possible data race during write of size 4 at 0x452F2DC
>==8329== at 0x40D5946: ompi_convertor_unpack (convertor.c:280)
>==8329== by 0x5067E78: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:215)
>==8329== by 0x508C9BB: mca_btl_sm_component_progress (btl_sm_component.c:426)
>==8329== by 0x41F72DF: opal_progress (opal_progress.c:207)
>==8329== by 0x40BD67D: opal_condition_wait (condition.h:85)
>==8329== by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
>==8329== by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
>==8329== by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs (coll_tuned_barrier.c:258)
>==8329== by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:192)
>==8329== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==8329== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==8329== by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
>==8329== Old state: owned exclusively by thread #7
>==8329== New state: shared-modified by threads #1, #7
>==8329== Reason: this thread, #1, holds no locks at all
>
>
It's unpacking message data. I would think this is okay, but someone
who understands the PML should say for sure.

>I guess the following are ok, but I provide them as a
>reference :
>
>==1220== Possible data race during write of size 4 at 0x8968780
>==1220== at 0x508A619: opal_atomic_unlock (atomic_impl.h:367)
>==1220== by 0x508B468: mca_btl_sm_send (btl_sm.c:811)
>==1220== by 0x5070A0C: mca_bml_base_send_status (bml.h:288)
>==1220== by 0x50708E6: mca_pml_ob1_send_request_start_copy (pml_ob1_sendreq.c:567)
>==1220== by 0x5064C30: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:363)
>==1220== by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
>==1220== by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
>==1220== by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
>==1220== by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling (coll_tuned_barrier.c:174)
>==1220== by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed (coll_tuned_decision_fixed.c:208)
>==1220== by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
>==1220== by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
>==1220== Old state: shared-modified by threads #1, #7
>==1220== New state: shared-modified by threads #1, #7
>==1220== Reason: this thread, #1, holds no consistent locks
>==1220== Location 0x8968780 has never been protected by any lock
>
>
Unlock during sm FIFO write? Yes, I would think this is okay.

My comments aren't intended to give the code base my unqualified okay.
I'm only saying that I read through these stacks and the sm BTL code
that's called out looks okay to me.

>ompi_info says :
> Package: Open MPI pelegrin_at_brol Distribution
> Open MPI: 1.3.3a1r21206
> Open MPI SVN revision: r21206
> Open MPI release date: Unreleased developer copy
> Open RTE: 1.3.3a1r21206
> Open RTE SVN revision: r21206
> Open RTE release date: Unreleased developer copy
> OPAL: 1.3.3a1r21206
> OPAL SVN revision: r21206
> OPAL release date: Unreleased developer copy
> Ident string: 1.3.3a1r21206
> Prefix: /usr/local
> Configured architecture: i686-pc-linux-gnu
> Configure host: brol
> Configured by: pelegrin
> Configured on: Tue May 12 15:50:08 CEST 2009
> Configure host: brol
> Built by: pelegrin
> Built on: Tue May 12 16:17:34 CEST 2009
> Built host: brol
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /usr/bin/gfortran
> Fortran90 compiler: gfortran
> Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: yes, progress: no)
> Sparse Groups: no
> Internal debug support: yes
> MPI parameter check: always
>Memory profiling support: no
>Memory debugging support: yes
> libltdl support: yes
> Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
> MPI_WTIME support: gettimeofday
>Symbol visibility support: yes
> FT Checkpoint support: no (checkpoint thread: no)
> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
> MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.3.3)
> MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
> MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
> MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
> MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>
>Thanks in advance for any help / explanation,
>
> f.p.
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>