Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] One-sided bugs
From: Jed Brown (jed_at_[hidden])
Date: 2012-12-30 21:14:13


I've resolved the problem in a satisfactory way by circumventing one-sided
entirely. I.e., this issue is finally closed:

https://bitbucket.org/petsc/petsc-dev/issue/9/implement-petscsf-without-one-sided

Users can proceed anyway using the run-time option
-acknowledge_ompi_onesided_bug, which will also be a convenient way to test
an eventual fix (beyond the reduced test cases that have been sitting in
your bug tracker for several years). (This is only relevant with -sf_type
window; the default no longer uses one-sided.)

I would still like to encourage Open MPI to deliver an error message in
this known broken case instead of silently stomping all over the user's
memory.

On Tue, Sep 11, 2012 at 2:23 PM, Jed Brown <jed_at_[hidden]> wrote:

> *Bump*
>
> There doesn't seem to have been any progress on this. Can you at least
> have an error message saying that Open MPI one-sided does not work with
> datatypes instead of silently causing wanton corruption and deadlock?
>
>
> On Thu, Dec 22, 2011 at 4:17 PM, Jed Brown <jed_at_[hidden]> wrote:
>
>> [Forgot the attachment.]
>>
>>
>> On Thu, Dec 22, 2011 at 15:16, Jed Brown <jed_at_[hidden]> wrote:
>>
>>> I wrote a new communication layer that we are evaluating for use in mesh
>>> management and PDE solvers, but it is based on MPI-2 one-sided operations
>>> (and will eventually benefit from some of the MPI-3 one-sided proposals,
>>> especially MPI_Fetch_and_op() and dynamic windows). All the basic
>>> functionality works well with MPICH2, but I have run into some Open MPI
>>> bugs regarding one-sided operations with composite data types. This email
>>> provides a reduced test case for two such bugs. I see that there are also
>>> some existing serious-looking bug reports regarding one-sided operations,
>>> but they are getting pretty old now and haven't seen action in a while.
>>>
>>> https://svn.open-mpi.org/trac/ompi/ticket/2656
>>> https://svn.open-mpi.org/trac/ompi/ticket/1905
>>>
>>> Is there a plan for resolving these in the near future?
>>>
>>> Is anyone using Open MPI for serious work with one-sided operations?
>>>
>>>
>>> Bugs I am reporting:
>>>
>>> *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win
>>> operation has completed, I get an invalid free when MPI_Type_free() is
>>> called before MPI_Win_free(). Since MPI_Type_free() is only supposed to
>>> mark the datatype for deletion, the implementation should properly manage
>>> reference counting. If you run the attached code with
>>>
>>> $ mpiexec -n 2 ./a.out 1
>>>
>>> (which only does part of the comm described for the second bug, below),
>>> you can see the invalid free on rank 1 with stack still in MPI_Win_fence()
>>>
>>> (gdb) bt
>>> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6
>>> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6
>>> #2 0x00007ffff72c147e in __libc_message () from /lib/libc.so.6
>>> #3 0x00007ffff72c7396 in malloc_printerr () from /lib/libc.so.6
>>> #4 0x00007ffff72cb26c in free () from /lib/libc.so.6
>>> #5 0x00007ffff7a5aaa8 in ompi_datatype_release_args (pData=0x845010) at
>>> ompi_datatype_args.c:414
>>> #6 0x00007ffff7a5b0ea in __ompi_datatype_release (datatype=0x845010) at
>>> ompi_datatype_create.c:47
>>> #7 0x00007ffff218e772 in opal_obj_run_destructors (object=0x845010) at
>>> ../../../../opal/class/opal_object.h:448
>>> #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at
>>> osc_rdma_replyreq.h:136
>>> #9 ompi_osc_rdma_replyreq_send_cb (btl=0x7ffff3680ce0,
>>> endpoint=<optimized out>, descriptor=0x837b00, status=<optimized out>) at
>>> osc_rdma_data_move.c:691
>>> #10 0x00007ffff347f38f in mca_btl_sm_component_progress () at
>>> btl_sm_component.c:645
>>> #11 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
>>> #12 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
>>> c=0x842ee0) at ../../../../opal/threads/condition.h:99
>>> #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at
>>> osc_rdma_sync.c:207
>>> #14 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at
>>> pwin_fence.c:60
>>> #15 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>>>
>>> meanwhile, rank 0 has already freed the datatype and is waiting in
>>> MPI_Win_free().
>>> (gdb) bt
>>> #0 0x00007ffff7312107 in sched_yield () from /lib/libc.so.6
>>> #1 0x00007ffff7b1f82b in opal_progress () at runtime/opal_progress.c:220
>>> #2 0x00007ffff7a53fe4 in opal_condition_wait (m=<optimized out>,
>>> c=<optimized out>) at ../opal/threads/condition.h:99
>>> #3 ompi_request_default_wait_all (count=2, requests=0x7fffffffd210,
>>> statuses=0x7fffffffd1e0) at request/req_wait.c:263
>>> #4 0x00007ffff25b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0,
>>> scount=0, sdatatype=0x7ffff7dba840, dest=1, stag=-16, recvbuf=<optimized
>>> out>, rcount=0, rdatatype=0x7ffff7dba840, source=1, rtag=-16,
>>> comm=0x8431a0, status=0x0) at coll_tuned_util.c:54
>>> #5 0x00007ffff25c2de2 in ompi_coll_tuned_barrier_intra_two_procs
>>> (comm=<optimized out>, module=<optimized out>) at coll_tuned_barrier.c:256
>>> #6 0x00007ffff25b92ab in ompi_coll_tuned_barrier_intra_dec_fixed
>>> (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190
>>> #7 0x00007ffff2186248 in ompi_osc_rdma_module_free (win=0x842170) at
>>> osc_rdma.c:46
>>> #8 0x00007ffff7a58a44 in ompi_win_free (win=0x842170) at win/win.c:150
>>> #9 0x00007ffff7a8a0dd in PMPI_Win_free (win=0x7fffffffd408) at
>>> pwin_free.c:56
>>> #10 0x0000000000401195 in main (argc=2, argv=0x7fffffffd508) at win.c:69
>>>
>>>
>>> *2.* This appears to be more fundamental and perhaps much harder to
>>> fix. The attached code sets up the following graph
>>>
>>> rank 0:
>>> 0 -> (1,0)
>>> 1 -> nothing
>>> 2 -> (1,1)
>>>
>>> rank 1:
>>> 0 -> (0,0)
>>> 1 -> (0,2)
>>> 2 -> (0,1)
>>>
>>> We pull over this graph using two calls to MPI_Get(), each with
>>> composite data types defining what to pull into the first two slots, and
>>> what to put into the third slot. It is Valgrind-clean with MPICH2, and
>>> produces the following:
>>>
>>> $ mpiexec.hydra -n 2 ./a.out 2
>>> [0] provided [100,101,102] got [200, -2,201]
>>> [1] provided [200,201,202] got [100,102,101]
>>>
>>> With Open MPI, I see
>>>
>>> a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr)
>>> (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct
>>> malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >=
>>> (unsigned long)((((__builtin_offsetof (struct malloc_chunk,
>>> fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
>>> 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) ==
>>> 0)' failed.
>>>
>>> on both ranks, with rank 0 at
>>>
>>> (gdb) bt
>>> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6
>>> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6
>>> #2 0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6
>>> #3 0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6
>>> #4 0x00007ffff72cad5d in malloc () from /lib/libc.so.6
>>> #5 0x00007ffff7b46c46 in opal_free_list_grow (flist=0x7ffff239f150,
>>> num_elements=1) at class/opal_free_list.c:93
>>> #6 0x00007ffff2196152 in ompi_osc_rdma_replyreq_alloc
>>> (replyreq=0x7fffffffd0f8, origin_rank=1, module=0x842d10) at
>>> osc_rdma_replyreq.h:82
>>> #7 ompi_osc_rdma_replyreq_alloc_init (module=0x842d10, origin=1,
>>> origin_request=..., target_displacement=0, target_count=1,
>>> datatype=0x8455b0, replyreq=0x7fffffffd0f8) at osc_rdma_replyreq.c:40
>>> #8 0x00007ffff218c051 in component_fragment_cb (btl=0x7ffff3680ce0,
>>> tag=<optimized out>, descriptor=<optimized out>, cbdata=<optimized out>) at
>>> osc_rdma_component.c:633
>>> #9 0x00007ffff347f25f in mca_btl_sm_component_progress () at
>>> btl_sm_component.c:623
>>> #10 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
>>> #11 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
>>> c=0x842de0) at ../../../../opal/threads/condition.h:99
>>> #12 ompi_osc_rdma_module_fence (assert=0, win=0x842170) at
>>> osc_rdma_sync.c:207
>>> #13 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842170) at
>>> pwin_fence.c:60
>>> #14 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>>>
>>> and rank 1 at
>>>
>>> (gdb) bt
>>> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6
>>> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6
>>> #2 0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6
>>> #3 0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6
>>> #4 0x00007ffff72cad5d in malloc () from /lib/libc.so.6
>>> #5 0x00007ffff7a5b3ce in opal_obj_new (cls=0x7ffff7db2060) at
>>> ../../opal/class/opal_object.h:469
>>> #6 opal_obj_new_debug (line=71, file=0x7ffff7b60323
>>> "ompi_datatype_create.c", type=0x7ffff7db2060) at
>>> ../../opal/class/opal_object.h:251
>>> #7 ompi_datatype_create (expectedSize=3) at ompi_datatype_create.c:71
>>> #8 0x00007ffff7a5b7e9 in ompi_datatype_create_indexed_block (count=1,
>>> bLength=1, pDisp=0x7fffee18a834, oldType=0x7ffff7db3640,
>>> newType=0x7fffffffd070) at ompi_datatype_create_indexed.c:124
>>> #9 0x00007ffff7a5a349 in __ompi_datatype_create_from_args (type=9,
>>> d=0x844f40, a=0x7fffee18a828, i=0x7fffee18a82c) at ompi_datatype_args.c:691
>>> #10 __ompi_datatype_create_from_packed_description
>>> (packed_buffer=0x7fffffffd108, remote_processor=0x652b90) at
>>> ompi_datatype_args.c:626
>>> #11 0x00007ffff7a5b045 in ompi_datatype_create_from_packed_description
>>> (packed_buffer=<optimized out>, remote_processor=<optimized out>) at
>>> ompi_datatype_args.c:779
>>> #12 0x00007ffff218bf60 in ompi_osc_base_datatype_create
>>> (payload=0x7fffffffd108, remote_proc=<optimized out>) at
>>> ../../../../ompi/mca/osc/base/osc_base_obj_convert.h:52
>>> #13 component_fragment_cb (btl=0x7ffff3680ce0, tag=<optimized out>,
>>> descriptor=<optimized out>, cbdata=<optimized out>) at
>>> osc_rdma_component.c:624
>>> #14 0x00007ffff347f25f in mca_btl_sm_component_progress () at
>>> btl_sm_component.c:623
>>> #15 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
>>> #16 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
>>> c=0x842ee0) at ../../../../opal/threads/condition.h:99
>>> #17 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at
>>> osc_rdma_sync.c:207
>>> #18 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at
>>> pwin_fence.c:60
>>> #19 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>>>
>>> This looks like memory corruption, but Open MPI internals are too noisy
>>> under valgrind for it to be obvious where to look. This is with Open MPI
>>> 1.5.4, but I observed the same thing with trunk. If I run with three
>>> processes, the graph is slightly different and only ranks 1 and 2 error
>>> (rank 0 hangs).
>>>
>>
>>
>