Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] mpool rdma deadlock
From: Christopher Yeoh (cyeoh_at_[hidden])
Date: 2009-10-28 08:18:28


I've been investigating some OpenMPI deadlocks triggered by a test suite
written to test the thread safety of MPI libraries. I'm using OpenMPI 1.3.3.

One of the deadlocks I'm seeing looks like this (the sleep for frame 1 is something I
inserted when a deadlock is detected so I can attach a debugger to all the nodes).

#0 0xf7d9e680 in nanosleep () from /lib/power6x/
#1 0xf7d9e408 in sleep () from /lib/power6x/
#2 0x0ee22ae4 in opal_mutex_lock (m=0x10101d00) at ../../../../opal/threads/mutex_unix.h:114
#3 0x0ee247f8 in mca_mpool_rdma_release_memory (mpool=0x10101a80, base=0xf2af0000, size=65536)
    at mpool_rdma_module.c:405

Trying to take mpool->rcache->lock

#4 0x0ff4fcac in mca_mpool_base_mem_cb (base=0xf2af0000, size=65536, cbdata=0x0, from_alloc=true)
    at base/mpool_base_mem_cb.c:52
#5 0x0fccade0 in opal_mem_hooks_release_hook (buf=0xf2af0000, length=65536, from_alloc=true)
    at memoryhooks/memory.c:132
#6 0x0fd176d8 in opal_mem_free_ptmalloc2_munmap (start=0xf2af0000, length=65536, from_alloc=1)
    at opal_ptmalloc2_munmap.c:74
#7 0x0fd18268 in new_heap (size=196608, top_pad=131072) at arena.c:552
#8 0x0fd1b1cc in sYSMALLOc (nb=152, av=0xefb00010) at malloc.c:2944
#9 0x0fd1dc2c in opal_memory_ptmalloc2_int_malloc (av=0xefb00010, bytes=144) at malloc.c:4319
#10 0x0fd1bd80 in opal_memory_ptmalloc2_malloc (bytes=144) at malloc.c:3432
#11 0x0fd1a968 in opal_memory_ptmalloc2_malloc_hook (sz=144, caller=0xee83678) at hooks.c:667
#12 0xf7d73a94 in malloc () from /lib/power6x/
#13 0x0ee83678 in opal_obj_new (cls=0xee956b8) at ../../../../opal/class/opal_object.h:473
#14 0x0ee835bc in opal_obj_new_debug (type=0xee956b8, file=0xee84ef0 "rcache_vma_tree.c", line=109)
    at ../../../../opal/class/opal_object.h:247
#15 0x0ee8380c in mca_rcache_vma_new (vma_rcache=0x10101ce8, start=3940155392, end=3940220927)
    at rcache_vma_tree.c:109
#16 0x0ee82f78 in mca_rcache_vma_tree_insert (vma_rcache=0x10101ce8, reg=0xefbfdc80, limit=0)
    at rcache_vma_tree.c:403
#17 0x0ee8205c in mca_rcache_vma_insert (rcache=0x10101ce8, reg=0xefbfdc80, limit=0) at rcache_vma.c:94
#18 0x0ee237e4 in mca_mpool_rdma_register (mpool=0x10101a80, addr=0xead90008, size=65536, flags=0,
    reg=0xf1f2e760) at mpool_rdma_module.c:250

already took mpool->rcache->lock earlier a bit before (~line 197)

#19 0x0ec2a680 in mca_btl_openib_prepare_dst (btl=0x101061c8, endpoint=0x10153178, registration=0x0,
    convertor=0xef909b88, order=255 '�', reserve=0, size=0x1014f70c, flags=6) at btl_openib.c:921
#20 0x0ed07724 in mca_bml_base_prepare_dst (bml_btl=0x10154378, reg=0x0, conv=0xef909b88, order=255 '�',
    reserve=0, size=0x1014f70c, flags=6, des=0xf1f2e7e8) at ../../../../ompi/mca/bml/bml.h:355
#21 0x0ed0747c in mca_pml_ob1_recv_request_get_frag (frag=0x1014f680) at pml_ob1_recvreq.c:359
#22 0x0ed07e38 in mca_pml_ob1_recv_request_progress_rget (recvreq=0xef909b00, btl=0x101061c8,
    segments=0x1023da10, num_segments=1) at pml_ob1_recvreq.c:527
#23 0x0ed039b8 in mca_pml_ob1_recv_frag_match (btl=0x101061c8, hdr=0xf514b230, segments=0x1023da10,
    num_segments=1, type=67) at pml_ob1_recvfrag.c:644
#24 0x0ed020b4 in mca_pml_ob1_recv_frag_callback_rget (btl=0x101061c8, tag=67 'C', des=0x1023d9b0,
    cbdata=0x0) at pml_ob1_recvfrag.c:275
#25 0x0ec3703c in btl_openib_handle_incoming (openib_btl=0x101061c8, ep=0x10153178, frag=0x1023d9b0,
---Type <return> to continue, or q <return> to quit---
    byte_len=76) at btl_openib_component.c:2616
#26 0x0ec3a66c in progress_one_device (device=0x10100e60) at btl_openib_component.c:3146
#27 0x0ec3a870 in btl_openib_component_progress () at btl_openib_component.c:3186
#28 0x0fccbf10 in opal_progress () at runtime/opal_progress.c:207
#29 0x0fe9d15c in opal_condition_wait (c=0xffa98e0, m=0xffa9930) at ../opal/threads/condition.h:85
#30 0x0fe9d7cc in ompi_request_default_wait_all (count=1, requests=0xf1f2eb00, statuses=0x0)
    at request/req_wait.c:270
#31 0x0ea97af8 in ompi_coll_tuned_reduce_generic (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
    original_count=1048576, datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8,
    module=0x10239aa8, tree=0x10239ff0, count_by_segment=16384, max_outstanding_reqs=0)
    at coll_tuned_reduce.c:168
#32 0x0ea98958 in ompi_coll_tuned_reduce_intra_pipeline (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
    count=1048576, datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8, module=0x10239aa8,
    segsize=65536, max_outstanding_reqs=0) at coll_tuned_reduce.c:400
#33 0x0ea85f2c in ompi_coll_tuned_reduce_intra_dec_fixed (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
    count=1048576, datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8, module=0x10239aa8)
    at coll_tuned_decision_fixed.c:414
#34 0x0ead4c4c in mca_coll_sync_reduce (sbuf=0xeeaf0008, rbuf=0xecca0008, count=1048576,
    dtype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8, module=0x102399b0) at coll_sync_reduce.c:43
#35 0x0fefc7dc in PMPI_Reduce (sendbuf=0xeeaf0008, recvbuf=0xecca0008, count=1048576,
    datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8) at preduce.c:129
#36 0x10004564 in reduce (thr_num=0x10233418) at mt_coll.c:804
#37 0xf7e869b4 in start_thread () from /lib/power6x/
#38 0xf7de13a4 in clone () from /lib/power6x/

ie. a thread is deadlocking itself. However this problem only appears to happen
when there are multiple threads running (maybe because of some memory pressure).

>From looking at the code it appears to be unsafe to ever hold the mpool->rcache->lock
when doing an operation that may cause a memory allocation as that may cause malloc to
call back into mpool rdma module and attempt to acquire the rcache lock again.

However the code seems to do that quite a bit (the above backtrace is just one example of
deadlocks I have seen).

I'm hoping someone else can verify that this is indeed a problem or if I'm just doing
something wrong (say some config option I'm missing). It doesn't appear to be that easy to fix
(eg would need to add some preallocation for paths that could currently call malloc
and in other areas would need quite a bit of rearrangement to be able to drop the rcache lock before
doing something that could call malloc).