Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-06-14 19:24:08


I see a segfault show up in trunk testing starting with r26598 when
tests like

     ibm collective/struct_gatherv
     intel src/MPI_Type_free_[types|pending_msg]_[f|c]

are run over openib. Here is a typical stack trace:

    opal_convertor_create_stack_at_begining(convertor = 0x689730,
sizes), line 404 in "opal_convertor.c"
    opal_convertor_set_position_nocheck(convertor = 0x689730, position),
line 423 in "opal_convertor.c"
    opal_convertor_set_position(convertor = 0x689730, position =
0x7fffc36e0bf0), line 321 in "opal_convertor.h"
    mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0,
size = 0), line 485 in "pml_ob1_sendreq.c"
    mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in
"pml_ob1_sendreq.h"
    mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in
"pml_ob1_sendreq.h"
    mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag =
-16, sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in
"pml_ob1_isend.c"
    ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0,
sdatatype, dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype,
source = 2, rtag = -16, comm, status = (nil)), line 51 in
"coll_tuned_util.c"
    ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line
172 in "coll_tuned_barrier.c"
    ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in
"coll_tuned_decision_fixed.c"
    PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
    main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219

The fact that some derived data types were sent before seems to have
something to do with it. I see this sort of problem cropping up in
Cisco and Oracle testing. Up at the level of
pml_ob1_send_request_start_copy, at line 485:

    MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);

I see

     *sendreq->req_send.req_base.req_convertor.use_desc = {
         length = 0
         used = 0
         desc = (nil)
     }

and I guess that desc=NULL is causing the segfault at opal_convertor.c
line 404.

Anyhow, I'm trudging along, but thought I would share at least that much
with you helpful folks in case any of this is ringing a bell.