Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mpool rdma deadlock
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-11-02 21:20:17


Ewww.... yikes. This could definitely be an issue if we weren't
(multi-thread) careful when writing these portions of the code. :-(

On Oct 28, 2009, at 8:18 AM, Christopher Yeoh wrote:

> Hi,
>
> I've been investigating some OpenMPI deadlocks triggered by a test
> suite
> written to test the thread safety of MPI libraries. I'm using
> OpenMPI 1.3.3.
>
> One of the deadlocks I'm seeing looks like this (the sleep for frame
> 1 is something I
> inserted when a deadlock is detected so I can attach a debugger to
> all the nodes).
>
> #0 0xf7d9e680 in nanosleep () from /lib/power6x/libc.so.6
> #1 0xf7d9e408 in sleep () from /lib/power6x/libc.so.6
> #2 0x0ee22ae4 in opal_mutex_lock (m=0x10101d00) at ../../../../opal/
> threads/mutex_unix.h:114
> #3 0x0ee247f8 in mca_mpool_rdma_release_memory (mpool=0x10101a80,
> base=0xf2af0000, size=65536)
> at mpool_rdma_module.c:405
>
> Trying to take mpool->rcache->lock
>
> #4 0x0ff4fcac in mca_mpool_base_mem_cb (base=0xf2af0000,
> size=65536, cbdata=0x0, from_alloc=true)
> at base/mpool_base_mem_cb.c:52
> #5 0x0fccade0 in opal_mem_hooks_release_hook (buf=0xf2af0000,
> length=65536, from_alloc=true)
> at memoryhooks/memory.c:132
> #6 0x0fd176d8 in opal_mem_free_ptmalloc2_munmap (start=0xf2af0000,
> length=65536, from_alloc=1)
> at opal_ptmalloc2_munmap.c:74
> #7 0x0fd18268 in new_heap (size=196608, top_pad=131072) at arena.c:
> 552
> #8 0x0fd1b1cc in sYSMALLOc (nb=152, av=0xefb00010) at malloc.c:2944
> #9 0x0fd1dc2c in opal_memory_ptmalloc2_int_malloc (av=0xefb00010,
> bytes=144) at malloc.c:4319
> #10 0x0fd1bd80 in opal_memory_ptmalloc2_malloc (bytes=144) at
> malloc.c:3432
> #11 0x0fd1a968 in opal_memory_ptmalloc2_malloc_hook (sz=144,
> caller=0xee83678) at hooks.c:667
> #12 0xf7d73a94 in malloc () from /lib/power6x/libc.so.6
> #13 0x0ee83678 in opal_obj_new (cls=0xee956b8) at ../../../../opal/
> class/opal_object.h:473
> #14 0x0ee835bc in opal_obj_new_debug (type=0xee956b8, file=0xee84ef0
> "rcache_vma_tree.c", line=109)
> at ../../../../opal/class/opal_object.h:247
> #15 0x0ee8380c in mca_rcache_vma_new (vma_rcache=0x10101ce8,
> start=3940155392, end=3940220927)
> at rcache_vma_tree.c:109
> #16 0x0ee82f78 in mca_rcache_vma_tree_insert (vma_rcache=0x10101ce8,
> reg=0xefbfdc80, limit=0)
> at rcache_vma_tree.c:403
> #17 0x0ee8205c in mca_rcache_vma_insert (rcache=0x10101ce8,
> reg=0xefbfdc80, limit=0) at rcache_vma.c:94
> #18 0x0ee237e4 in mca_mpool_rdma_register (mpool=0x10101a80,
> addr=0xead90008, size=65536, flags=0,
> reg=0xf1f2e760) at mpool_rdma_module.c:250
>
> already took mpool->rcache->lock earlier a bit before (~line 197)
>
> #19 0x0ec2a680 in mca_btl_openib_prepare_dst (btl=0x101061c8,
> endpoint=0x10153178, registration=0x0,
> convertor=0xef909b88, order=255 '�', reserve=0,
> size=0x1014f70c, flags=6) at btl_openib.c:921
> #20 0x0ed07724 in mca_bml_base_prepare_dst (bml_btl=0x10154378,
> reg=0x0, conv=0xef909b88, order=255 '�',
> reserve=0, size=0x1014f70c, flags=6, des=0xf1f2e7e8)
> at ../../../../ompi/mca/bml/bml.h:355
> #21 0x0ed0747c in mca_pml_ob1_recv_request_get_frag
> (frag=0x1014f680) at pml_ob1_recvreq.c:359
> #22 0x0ed07e38 in mca_pml_ob1_recv_request_progress_rget
> (recvreq=0xef909b00, btl=0x101061c8,
> segments=0x1023da10, num_segments=1) at pml_ob1_recvreq.c:527
> #23 0x0ed039b8 in mca_pml_ob1_recv_frag_match (btl=0x101061c8,
> hdr=0xf514b230, segments=0x1023da10,
> num_segments=1, type=67) at pml_ob1_recvfrag.c:644
> #24 0x0ed020b4 in mca_pml_ob1_recv_frag_callback_rget
> (btl=0x101061c8, tag=67 'C', des=0x1023d9b0,
> cbdata=0x0) at pml_ob1_recvfrag.c:275
> #25 0x0ec3703c in btl_openib_handle_incoming (openib_btl=0x101061c8,
> ep=0x10153178, frag=0x1023d9b0,
> ---Type <return> to continue, or q <return> to quit---
> byte_len=76) at btl_openib_component.c:2616
> #26 0x0ec3a66c in progress_one_device (device=0x10100e60) at
> btl_openib_component.c:3146
> #27 0x0ec3a870 in btl_openib_component_progress () at
> btl_openib_component.c:3186
> #28 0x0fccbf10 in opal_progress () at runtime/opal_progress.c:207
> #29 0x0fe9d15c in opal_condition_wait (c=0xffa98e0, m=0xffa9930)
> at ../opal/threads/condition.h:85
> #30 0x0fe9d7cc in ompi_request_default_wait_all (count=1,
> requests=0xf1f2eb00, statuses=0x0)
> at request/req_wait.c:270
> #31 0x0ea97af8 in ompi_coll_tuned_reduce_generic
> (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
> original_count=1048576, datatype=0x10015f50, op=0x10016360,
> root=0, comm=0x102394d8,
> module=0x10239aa8, tree=0x10239ff0, count_by_segment=16384,
> max_outstanding_reqs=0)
> at coll_tuned_reduce.c:168
> #32 0x0ea98958 in ompi_coll_tuned_reduce_intra_pipeline
> (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
> count=1048576, datatype=0x10015f50, op=0x10016360, root=0,
> comm=0x102394d8, module=0x10239aa8,
> segsize=65536, max_outstanding_reqs=0) at coll_tuned_reduce.c:400
> #33 0x0ea85f2c in ompi_coll_tuned_reduce_intra_dec_fixed
> (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
> count=1048576, datatype=0x10015f50, op=0x10016360, root=0,
> comm=0x102394d8, module=0x10239aa8)
> at coll_tuned_decision_fixed.c:414
> #34 0x0ead4c4c in mca_coll_sync_reduce (sbuf=0xeeaf0008,
> rbuf=0xecca0008, count=1048576,
> dtype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8,
> module=0x102399b0) at coll_sync_reduce.c:43
> #35 0x0fefc7dc in PMPI_Reduce (sendbuf=0xeeaf0008,
> recvbuf=0xecca0008, count=1048576,
> datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8) at
> preduce.c:129
> #36 0x10004564 in reduce (thr_num=0x10233418) at mt_coll.c:804
> #37 0xf7e869b4 in start_thread () from /lib/power6x/libpthread.so.0
> #38 0xf7de13a4 in clone () from /lib/power6x/libc.so.6
>
> ie. a thread is deadlocking itself. However this problem only
> appears to happen
> when there are multiple threads running (maybe because of some
> memory pressure).
>
> From looking at the code it appears to be unsafe to ever hold the
> mpool->rcache->lock
> when doing an operation that may cause a memory allocation as that
> may cause malloc to
> call back into mpool rdma module and attempt to acquire the rcache
> lock again.
>
> However the code seems to do that quite a bit (the above backtrace
> is just one example of
> deadlocks I have seen).
>
> I'm hoping someone else can verify that this is indeed a problem or
> if I'm just doing
> something wrong (say some config option I'm missing). It doesn't
> appear to be that easy to fix
> (eg would need to add some preallocation for paths that could
> currently call malloc
> and in other areas would need quite a bit of rearrangement to be
> able to drop the rcache lock before
> doing something that could call malloc).
>
> Regards,
>
> Chris
> --
> cyeoh_at_[hidden]
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]