Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Failure in MPI_Gather
From: Guillaume Sylvand (guillaume.sylvand_at_[hidden])
Date: 2009-07-21 10:12:01


Hi,

I'm having a problem with MPI_Gather in openMPI 1.3.3. The code that fails here works fine with mpich1.2.5, mpich2 1.1 and hpmpi 2.2.5 (I'm not blaming anyone, I just want to understand !). My code runs locally on a bi-pro, debian 32 bits, with 2 processes, and fails during an MPI_Gather call with the following message :
[sabrina:14631] *** An error occurred in MPI_Gather
[sabrina:14631] *** on communicator MPI COMMUNICATOR 37 SPLIT FROM 5
[sabrina:14631] *** MPI_ERR_TRUNCATE: message truncated
[sabrina:14631] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
when I run it with memchecker, valgrind produces the following message about an uninitialised value (I know that sometimes valgrind is wrong about this kind of error) :
==14634==
==14634== Conditional jump or move depends on uninitialised value(s)
==14634==    at 0x42E3A4C: ompi_convertor_need_buffers (convertor.h:175)
==14634==    by 0x42E3800: mca_pml_ob1_recv_request_ack (pml_ob1_recvreq.c:264)
==14634==    by 0x42E5566: mca_pml_ob1_recv_request_progress_rndv (pml_ob1_recvreq.c:554)
==14634==    by 0x42E1316: mca_pml_ob1_recv_frag_match (pml_ob1_recvfrag.c:641)
==14634==    by 0x42DFFDD: mca_pml_ob1_recv_frag_callback_rndv (pml_ob1_recvfrag.c:259)
==14634==    by 0x42322E7: mca_btl_sm_component_progress (btl_sm_component.c:426)
==14634==    by 0x44E3CF4: opal_progress (opal_progress.c:207)
==14634==    by 0x41A6E66: opal_condition_wait (condition.h:99)
==14634==    by 0x41A73E6: ompi_request_default_wait_all (req_wait.c:262)
==14634==    by 0x424E99A: ompi_coll_tuned_gather_intra_linear_sync (coll_tuned_gather.c:328)
==14634==    by 0x423CB98: ompi_coll_tuned_gather_intra_dec_fixed (coll_tuned_decision_fixed.c:718)
==14634==    by 0x4252B9E: mca_coll_sync_gather (coll_sync_gather.c:46)
==14634==

This is the first error message, if we except those produced during MPI_Init(). If I attach the debugger, I get the following backtrace :
0x042e3a4c in ompi_convertor_need_buffers (pConvertor=0x4a2c000)
    at ../../../../../../ompi/datatype/convertor.h:175
175     ../../../../../../ompi/datatype/convertor.h: No such file or directory.
        in ../../../../../../ompi/datatype/convertor.h
(gdb) where
#0  0x042e3a4c in ompi_convertor_need_buffers (pConvertor=0x4a2c000)
    at ../../../../../../ompi/datatype/convertor.h:175
#1  0x042e3801 in mca_pml_ob1_recv_request_ack (recvreq=0x4a2bf80,
    hdr=0x95b0a90, bytes_received=4032)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:264
#2  0x042e5567 in mca_pml_ob1_recv_request_progress_rndv (recvreq=0x4a2bf80,
    btl=0x4375260, segments=0xbecc3490, num_segments=1)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:554
#3  0x042e1317 in mca_pml_ob1_recv_frag_match (btl=0x4375260, hdr=0x95b0a90,
    segments=0xbecc3490, num_segments=1, type=66)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:641
#4  0x042dffde in mca_pml_ob1_recv_frag_callback_rndv (btl=0x4375260,
    tag=66 'B', des=0xbecc3438, cbdata=0x0)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:259
#5  0x042322e8 in mca_btl_sm_component_progress ()
    at ../../../../../../ompi/mca/btl/sm/btl_sm_component.c:426
#6  0x044e3cf5 in opal_progress () at ../../../opal/runtime/opal_progress.c:207
#7  0x041a6e67 in opal_condition_wait (c=0x4382700, m=0x4382760)
    at ../../../opal/threads/condition.h:99
#8  0x041a73e7 in ompi_request_default_wait_all (count=2, requests=0x4ef5360,
    statuses=0x0) at ../../../ompi/request/req_wait.c:262
#9  0x0424e99b in ompi_coll_tuned_gather_intra_linear_sync (sbuf=0x4ebd438,
    scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348,
    root=0, comm=0x4d0d8a8, module=0x4d0e220, first_segment_size=1024)
    at ../../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:328
#10 0x0423cb99 in ompi_coll_tuned_gather_intra_dec_fixed (sbuf=0x4ebd438,
    scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348,
    root=0, comm=0x4d0d8a8, module=0x4d0e220)
    at ../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:718
#11 0x04252b9f in mca_coll_sync_gather (sbuf=0x4ebd438, scount=3016,
    sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348, root=0,
    comm=0x4d0d8a8, module=0x4d0e098)
    at ../../../../../../ompi/mca/coll/sync/coll_sync_gather.c:46
#12 0x041db441 in PMPI_Gather (sendbuf=0x4ebd438, sendcount=3016,
    sendtype=0x4a3aa70, recvbuf=0x4ecda00, recvcount=1, recvtype=0x4f4b348,
    root=0, comm=0x4d0d8a8) at pgather.c:175
#13 0x082a47c9 in MPF_GEMV_SPARSE_INCORE (comm_row=0x4d0ce38,
    comm_col=0x4d0d8a8, transa=84 'T', M=232, N=464, P=232, Q=13,
    ALPHA=0x8d22e88, gBuffer=0x4f4aec0, bufferB=0x4f3f210, bufferC=0x4ebd438)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:160
#14 0x082a592b in MPF_GEMV_SPARSE (TRANSA=0xbecc38f7 "T", ALPHA=0x8d22e88,
    matA=0x4d0b7d0, vecB=0x4cbb7e0, BETA=0x8d22e88, vecC=0x4f3d8f0)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:331
#15 0x08251f2a in MPF_GEMV (transa=0x8c937ec "T", alpha=0x8d22e88,
    matA=0x4d0b7d0, vecB=0x4cbb7e0, beta=0x8d22e88, vecC=0x4f3d8f0)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/INTERFACE/mat_gemv.c:150
#16 0x080ab641 in main (argc=1, argv=0xbecc3aa4)
    at /home/gsylvand/ACTIPOLE/src/COUCHA/SRC/coucha.c:358
The content of pConvertor is :
(gdb)  p pConvertor[0]
$2 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x43741e0,
    obj_reference_count = 1,
    cls_init_file_name = 0x435687c "../../../../../ompi/mca/pml/base/pml_base_recvreq.c", cls_init_lineno = 42}, remoteArch = 4291428864, flags = 134873088,
  local_size = 0, remote_size = 0, pDesc = 0x0, use_desc = 0x0, count = 0,
  pBaseBuf = 0x0, pStack = 0x4a2c060, stack_size = 5, fAdvance = 0,
  master = 0x485eb60, stack_pos = 0, bConverted = 0, partial_length = 0,
  checksum = 0, csum_ui1 = 0, csum_ui2 = 0, static_stack = {{index = 0,
      type = 0, count = 0, disp = 0}, {index = 0, type = 0, count = 0,
      disp = 0}, {index = 0, type = 0, count = 0, disp = 0}, {index = 0,
      type = 0, count = 0, disp = 0}, {index = 0, type = 0, count = 0,
      disp = 0}}}

The MPI_Gather that fails is a bit complicated, since it uses MPI type created with MPI_Type_vector and MPI_Struct. The call is :
/* here we have N=464 P=232 Q=13 */
    bufferC = calloc(P * Q, 2*sizeof(double));
    bufferE = calloc(N * Q, 2*sizeof(double));
....
    ierr = MPI_Gather( bufferC, P*Q, BasicType, bufferE, 1, NStridedType, 0, comm_col );
where BasicType is a double complex created with :
    MPI_Type_contiguous(2, MPI_DOUBLE, &BasicType);
    MPI_Type_commit(&BasicType);
and NStridedType is an array of Q blocks of P complexes every N with extent=P, created with :
  MPI_Type_vector(Q, P, N, BasicType, &QPNStridedType) ; /* Q blocks of P BasicType every N */
  disp[0]=0 ;
  type[0]=QPNStridedType ;
  blocklen[0]=1 ;
  MPI_Type_extent(BasicType, &(disp[1]) ) ;
  disp[1] *= P ;
  type[1]=MPI_UB ;
  blocklen[1]=1 ;
  MPI_Type_struct(2, blocklen, disp, type, &NStridedType) ; /* Just to set the extent=P */
  MPI_Type_commit(&NStridedType) ;

As mentionned earlier, this works with other MPI implementation, and this kind of mechanism is widely used in this code, and it works (usually) fine.
Moreover, if I replace MPI_Gather by MPI_Allgather, no more bugs, it works :
ierr = MPI_Allgather( bufferC, P*Q, BasicType, bufferE, 1, NStridedType, comm_col ); CHKERRQ(ierr) ;
Another strange thing is that if I try to produce a small test.c code with these commands to reproduce this bug, no more bug ! It works :(
Any suggestions on something to test ?
Thanks in advance for your help,
Best regards,

Guillaume
-- 
Guillaume SYLVAND