Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-06-15 16:24:14


On Jun 15, 2012, at 20:59 , Nathan Hjelm wrote:

> Seems like either a bug in the converter code or in setting up the send request. r26597 ensures correctness in the case the btl's sendi does all three of the following: returns an error, changes the converter, and returns a descriptor.

None of the above. There is a shortcut in the PML preventing the creation of a convertor in case the amount of data is zero. This shortcut saves few tens of instructions in the critical path.

  george.

>
> Until we can find the root cause I pushed a change that protects the reset by checking if size > 0.
>
> Let me know if that works for you.
>
> -Nathan
>
> On Fri, Jun 15, 2012 at 12:34:32PM -0400, Eugene Loh wrote:
>> Backing out r26597 solves my particular test cases. I'll back it
>> out of the trunk as well unless someone has objections.
>>
>> I like how you say "same segfault." In certain cases, I just go on
>> to different segfaults. E.g.,
>>
>> [2] btl_openib_handle_incoming(openib_btl, ep, frag, byte_len =
>> 20U), line 3208 in "btl_openib_component.c"
>> [3] handle_wc(device, cq = 0, wc), line 3516 in "btl_openib_component.c"
>> [4] poll_device(device, count = 1), line 3654 in "btl_openib_component.c"
>> [5] progress_one_device(device), line 3762 in "btl_openib_component.c"
>> [6] btl_openib_component_progress(), line 3787 in
>> "btl_openib_component.c"
>> [7] opal_progress(), line 207 in "opal_progress.c"
>> [8] opal_condition_wait(c, m), line 100 in "condition.h"
>> [9] ompi_request_default_wait_all(count = 2U, requests, statuses),
>> line 281 in "req_wait.c"
>> [10] ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0,
>> sdatatype, dest = 0, stag = -16, recvbuf = (nil), rcount = 0,
>> rdatatype, source = 0, rtag = -16, comm, status = (nil)), line 54 in
>> "coll_tuned_util.c"
>> [11] ompi_coll_tuned_barrier_intra_recursivedoubling(comm,
>> module), line 172 in "coll_tuned_barrier.c"
>> [12] ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line
>> 207 in "coll_tuned_decision_fixed.c"
>> [13] PMPI_Barrier(comm = 0x518370), line 62 in "pbarrier.c"
>>
>> The reg->cbfunc is NULL. I'm still considering whether that's an
>> artifact of how I build that particular case, though.
>>
>> On 06/15/12 09:44, George Bosilca wrote:
>>> There should be no datatype attached to the barrier, so it is normal you get the zero values in the convertor.
>>>
>>> Something weird is definitively going on. As there is no data to be sent, the opal_convertor_set_position function is supposed to trigger the special path, mark the convertor as completed and return successfully. However, this seems not to be the case anymore as in your backtrace I see the call to opal_convertor_set_position_nocheck, which only happens if the above described test fails.
>>>
>>> I had some doubts about r26597, but I don't have time to check into it until Monday. Maybe you can remove it and se if you continue to have the same segfault.
>>>
>>> george.
>>>
>>> On Jun 15, 2012, at 01:24 , Eugene Loh wrote:
>>>
>>>> I see a segfault show up in trunk testing starting with r26598 when tests like
>>>>
>>>> ibm collective/struct_gatherv
>>>> intel src/MPI_Type_free_[types|pending_msg]_[f|c]
>>>>
>>>> are run over openib. Here is a typical stack trace:
>>>>
>>>> opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), line 404 in "opal_convertor.c"
>>>> opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 423 in "opal_convertor.c"
>>>> opal_convertor_set_position(convertor = 0x689730, position = 0x7fffc36e0bf0), line 321 in "opal_convertor.h"
>>>> mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 0), line 485 in "pml_ob1_sendreq.c"
>>>> mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in "pml_ob1_sendreq.h"
>>>> mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in "pml_ob1_sendreq.h"
>>>> mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in "pml_ob1_isend.c"
>>>> ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, rtag = -16, comm, status = (nil)), line 51 in "coll_tuned_util.c"
>>>> ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 in "coll_tuned_barrier.c"
>>>> ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in "coll_tuned_decision_fixed.c"
>>>> PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
>>>> main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219
>>>>
>>>> The fact that some derived data types were sent before seems to have something to do with it. I see this sort of problem cropping up in Cisco and Oracle testing. Up at the level of pml_ob1_send_request_start_copy, at line 485:
>>>>
>>>> MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);
>>>>
>>>> I see
>>>>
>>>> *sendreq->req_send.req_base.req_convertor.use_desc = {
>>>> length = 0
>>>> used = 0
>>>> desc = (nil)
>>>> }
>>>>
>>>> and I guess that desc=NULL is causing the segfault at opal_convertor.c line 404.
>>>>
>>>> Anyhow, I'm trudging along, but thought I would share at least that much with you helpful folks in case any of this is ringing a bell.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel