Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mpool rdma deadlock - malloc implementation issue
From: Christopher Yeoh (cyeoh_at_[hidden])
Date: 2009-11-16 23:54:09


Hi,

Just following up on this. I had a closer look at this part
which is common to most of the deadlocks

#6 0x0fd176d8 in opal_mem_free_ptmalloc2_munmap
   (start=0xf2af0000, length=65536, from_alloc=1)
    at opal_ptmalloc2_munmap.c:74
#7 0x0fd18268 in new_heap (size=196608, top_pad=131072) at
    arena.c: 552
#8 0x0fd1b1cc in sYSMALLOc (nb=152, av=0xefb00010) at malloc.c:2944

I see from previous discussion on the mailing list that the thread
safety of areas of the malloc library have come up before. In this case
a memory allocation for a thread requires a the creation of a new arena
(I think the arena code is only enabled for threads which explains why
this problem doesn't come up otherwise).

In the new_heap (arena.c) call there is:

  /* A memory region aligned to a multiple of HEAP_MAX_SIZE is needed.
     No swap space needs to be reserved for the following large
     mapping (on Linux, this is the case for all non-writable mappings
     anyway). */
  p1 = (char *)MMAP(0, HEAP_MAX_SIZE<<1, PROT_NONE,
  MAP_PRIVATE|MAP_NORESERVE); if(p1 != MAP_FAILED) {
    p2 = (char *)(((unsigned long)p1 + (HEAP_MAX_SIZE-1)) &
  ~(HEAP_MAX_SIZE-1)); ul = p2 - p1;
    munmap(p1, ul);
    munmap(p2 + HEAP_MAX_SIZE, HEAP_MAX_SIZE - ul);
  } else {

Eg. Allocating an area for the heap which is larger than required to
ensure appropriate alignment and then munmap'ing the part which isn't
needed.

The problem is that munmap is intercepted which calls back into the
mpool code which then deadlocks. Since none of this memory has been
registered I believe it would in these specific munmap cases to call the
real munmap directly rather than go through the intercept which causes
the deadlock.

I've done some testing (on trunk) and so far it seems ok - can anyone
see any problems with this change?

Regards,

Chris

On Mon, 2 Nov 2009 21:20:17 -0500
Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Ewww.... yikes. This could definitely be an issue if we weren't
> (multi-thread) careful when writing these portions of the code. :-(
>
>
> On Oct 28, 2009, at 8:18 AM, Christopher Yeoh wrote:
>
> > Hi,
> >
> > I've been investigating some OpenMPI deadlocks triggered by a test
> > suite
> > written to test the thread safety of MPI libraries. I'm using
> > OpenMPI 1.3.3.
> >
> > One of the deadlocks I'm seeing looks like this (the sleep for
> > frame 1 is something I
> > inserted when a deadlock is detected so I can attach a debugger to
> > all the nodes).
> >
> > #0 0xf7d9e680 in nanosleep () from /lib/power6x/libc.so.6
> > #1 0xf7d9e408 in sleep () from /lib/power6x/libc.so.6
> > #2 0x0ee22ae4 in opal_mutex_lock (m=0x10101d00)
> > at ../../../../opal/ threads/mutex_unix.h:114
> > #3 0x0ee247f8 in mca_mpool_rdma_release_memory (mpool=0x10101a80,
> > base=0xf2af0000, size=65536)
> > at mpool_rdma_module.c:405
> >
> > Trying to take mpool->rcache->lock
> >
> > #4 0x0ff4fcac in mca_mpool_base_mem_cb (base=0xf2af0000,
> > size=65536, cbdata=0x0, from_alloc=true)
> > at base/mpool_base_mem_cb.c:52
> > #5 0x0fccade0 in opal_mem_hooks_release_hook (buf=0xf2af0000,
> > length=65536, from_alloc=true)
> > at memoryhooks/memory.c:132
> > #6 0x0fd176d8 in opal_mem_free_ptmalloc2_munmap
> > (start=0xf2af0000, length=65536, from_alloc=1)
> > at opal_ptmalloc2_munmap.c:74
> > #7 0x0fd18268 in new_heap (size=196608, top_pad=131072) at
> > arena.c: 552
> > #8 0x0fd1b1cc in sYSMALLOc (nb=152, av=0xefb00010) at malloc.c:2944
> > #9 0x0fd1dc2c in opal_memory_ptmalloc2_int_malloc (av=0xefb00010,
> > bytes=144) at malloc.c:4319
> > #10 0x0fd1bd80 in opal_memory_ptmalloc2_malloc (bytes=144) at
> > malloc.c:3432
> > #11 0x0fd1a968 in opal_memory_ptmalloc2_malloc_hook (sz=144,
> > caller=0xee83678) at hooks.c:667
> > #12 0xf7d73a94 in malloc () from /lib/power6x/libc.so.6
> > #13 0x0ee83678 in opal_obj_new (cls=0xee956b8) at ../../../../opal/
> > class/opal_object.h:473
> > #14 0x0ee835bc in opal_obj_new_debug (type=0xee956b8,
> > file=0xee84ef0 "rcache_vma_tree.c", line=109)
> > at ../../../../opal/class/opal_object.h:247
> > #15 0x0ee8380c in mca_rcache_vma_new (vma_rcache=0x10101ce8,
> > start=3940155392, end=3940220927)
> > at rcache_vma_tree.c:109
> > #16 0x0ee82f78 in mca_rcache_vma_tree_insert
> > (vma_rcache=0x10101ce8, reg=0xefbfdc80, limit=0)
> > at rcache_vma_tree.c:403
> > #17 0x0ee8205c in mca_rcache_vma_insert (rcache=0x10101ce8,
> > reg=0xefbfdc80, limit=0) at rcache_vma.c:94
> > #18 0x0ee237e4 in mca_mpool_rdma_register (mpool=0x10101a80,
> > addr=0xead90008, size=65536, flags=0,
> > reg=0xf1f2e760) at mpool_rdma_module.c:250
> >
> > already took mpool->rcache->lock earlier a bit before (~line 197)
> >
> > #19 0x0ec2a680 in mca_btl_openib_prepare_dst (btl=0x101061c8,
> > endpoint=0x10153178, registration=0x0,
> > convertor=0xef909b88, order=255 '�', reserve=0,
> > size=0x1014f70c, flags=6) at btl_openib.c:921
> > #20 0x0ed07724 in mca_bml_base_prepare_dst (bml_btl=0x10154378,
> > reg=0x0, conv=0xef909b88, order=255 '�',
> > reserve=0, size=0x1014f70c, flags=6, des=0xf1f2e7e8)
> > at ../../../../ompi/mca/bml/bml.h:355
> > #21 0x0ed0747c in mca_pml_ob1_recv_request_get_frag
> > (frag=0x1014f680) at pml_ob1_recvreq.c:359
> > #22 0x0ed07e38 in mca_pml_ob1_recv_request_progress_rget
> > (recvreq=0xef909b00, btl=0x101061c8,
> > segments=0x1023da10, num_segments=1) at pml_ob1_recvreq.c:527
> > #23 0x0ed039b8 in mca_pml_ob1_recv_frag_match (btl=0x101061c8,
> > hdr=0xf514b230, segments=0x1023da10,
> > num_segments=1, type=67) at pml_ob1_recvfrag.c:644
> > #24 0x0ed020b4 in mca_pml_ob1_recv_frag_callback_rget
> > (btl=0x101061c8, tag=67 'C', des=0x1023d9b0,
> > cbdata=0x0) at pml_ob1_recvfrag.c:275
> > #25 0x0ec3703c in btl_openib_handle_incoming
> > (openib_btl=0x101061c8, ep=0x10153178, frag=0x1023d9b0,
> > ---Type <return> to continue, or q <return> to quit---
> > byte_len=76) at btl_openib_component.c:2616
> > #26 0x0ec3a66c in progress_one_device (device=0x10100e60) at
> > btl_openib_component.c:3146
> > #27 0x0ec3a870 in btl_openib_component_progress () at
> > btl_openib_component.c:3186
> > #28 0x0fccbf10 in opal_progress () at runtime/opal_progress.c:207
> > #29 0x0fe9d15c in opal_condition_wait (c=0xffa98e0, m=0xffa9930)
> > at ../opal/threads/condition.h:85
> > #30 0x0fe9d7cc in ompi_request_default_wait_all (count=1,
> > requests=0xf1f2eb00, statuses=0x0)
> > at request/req_wait.c:270
> > #31 0x0ea97af8 in ompi_coll_tuned_reduce_generic
> > (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
> > original_count=1048576, datatype=0x10015f50, op=0x10016360,
> > root=0, comm=0x102394d8,
> > module=0x10239aa8, tree=0x10239ff0, count_by_segment=16384,
> > max_outstanding_reqs=0)
> > at coll_tuned_reduce.c:168
> > #32 0x0ea98958 in ompi_coll_tuned_reduce_intra_pipeline
> > (sendbuf=0xeeaf0008, recvbuf=0xecca0008,
> > count=1048576, datatype=0x10015f50, op=0x10016360, root=0,
> > comm=0x102394d8, module=0x10239aa8,
> > segsize=65536, max_outstanding_reqs=0) at
> > coll_tuned_reduce.c:400 #33 0x0ea85f2c in
> > ompi_coll_tuned_reduce_intra_dec_fixed (sendbuf=0xeeaf0008,
> > recvbuf=0xecca0008, count=1048576, datatype=0x10015f50,
> > op=0x10016360, root=0, comm=0x102394d8, module=0x10239aa8)
> > at coll_tuned_decision_fixed.c:414
> > #34 0x0ead4c4c in mca_coll_sync_reduce (sbuf=0xeeaf0008,
> > rbuf=0xecca0008, count=1048576,
> > dtype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8,
> > module=0x102399b0) at coll_sync_reduce.c:43
> > #35 0x0fefc7dc in PMPI_Reduce (sendbuf=0xeeaf0008,
> > recvbuf=0xecca0008, count=1048576,
> > datatype=0x10015f50, op=0x10016360, root=0, comm=0x102394d8)
> > at preduce.c:129
> > #36 0x10004564 in reduce (thr_num=0x10233418) at mt_coll.c:804
> > #37 0xf7e869b4 in start_thread () from /lib/power6x/libpthread.so.0
> > #38 0xf7de13a4 in clone () from /lib/power6x/libc.so.6
> >
> > ie. a thread is deadlocking itself. However this problem only
> > appears to happen
> > when there are multiple threads running (maybe because of some
> > memory pressure).
> >
> > From looking at the code it appears to be unsafe to ever hold the
> > mpool->rcache->lock
> > when doing an operation that may cause a memory allocation as that
> > may cause malloc to
> > call back into mpool rdma module and attempt to acquire the rcache
> > lock again.
> >
> > However the code seems to do that quite a bit (the above backtrace
> > is just one example of
> > deadlocks I have seen).
> >
> > I'm hoping someone else can verify that this is indeed a problem
> > or if I'm just doing
> > something wrong (say some config option I'm missing). It doesn't
> > appear to be that easy to fix
> > (eg would need to add some preallocation for paths that could
> > currently call malloc
> > and in other areas would need quite a bit of rearrangement to be
> > able to drop the rcache lock before
> > doing something that could call malloc).
> >
> > Regards,
> >
> > Chris
> > --
> > cyeoh_at_[hidden]
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
>

-- 
cyeoh_at_[hidden]