Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-06-15 09:44:13


There should be no datatype attached to the barrier, so it is normal you get the zero values in the convertor.

Something weird is definitively going on. As there is no data to be sent, the opal_convertor_set_position function is supposed to trigger the special path, mark the convertor as completed and return successfully. However, this seems not to be the case anymore as in your backtrace I see the call to opal_convertor_set_position_nocheck, which only happens if the above described test fails.

I had some doubts about r26597, but I don't have time to check into it until Monday. Maybe you can remove it and se if you continue to have the same segfault.

  george.

On Jun 15, 2012, at 01:24 , Eugene Loh wrote:

> I see a segfault show up in trunk testing starting with r26598 when tests like
>
> ibm collective/struct_gatherv
> intel src/MPI_Type_free_[types|pending_msg]_[f|c]
>
> are run over openib. Here is a typical stack trace:
>
> opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), line 404 in "opal_convertor.c"
> opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 423 in "opal_convertor.c"
> opal_convertor_set_position(convertor = 0x689730, position = 0x7fffc36e0bf0), line 321 in "opal_convertor.h"
> mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 0), line 485 in "pml_ob1_sendreq.c"
> mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in "pml_ob1_sendreq.h"
> mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in "pml_ob1_sendreq.h"
> mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in "pml_ob1_isend.c"
> ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, rtag = -16, comm, status = (nil)), line 51 in "coll_tuned_util.c"
> ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 in "coll_tuned_barrier.c"
> ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in "coll_tuned_decision_fixed.c"
> PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
> main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219
>
> The fact that some derived data types were sent before seems to have something to do with it. I see this sort of problem cropping up in Cisco and Oracle testing. Up at the level of pml_ob1_send_request_start_copy, at line 485:
>
> MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);
>
> I see
>
> *sendreq->req_send.req_base.req_convertor.use_desc = {
> length = 0
> used = 0
> desc = (nil)
> }
>
> and I guess that desc=NULL is causing the segfault at opal_convertor.c line 404.
>
> Anyhow, I'm trudging along, but thought I would share at least that much with you helpful folks in case any of this is ringing a bell.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel