Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problem: error polling HP CQ...
From: Matt Hughes ([hidden])
Date: 2008-06-06 01:51:53

2008/6/4 Jeff Squyres <jsquyres_at_[hidden]>:
> Would it be possible for you to try a trunk nightly tarball snapshot,
> perchance?

I have attempted to use openmpi-1.3a1r18569. After some pain getting
MPI_Comm_spawn to work (I will write about that in a separate
message), I was able to get my app started. If segfaults in
btl_openib_handle_incoming() by dereferencing a null pointer:

#0 0x0000000000000000 in ?? ()
#1 0x0000002a98059777 in btl_openib_handle_incoming (openib_btl=0xb8b900,
   ep=0xbecb70, frag=0xc8da80, byte_len=24) at btl_openib_component.c:2129
#2 0x0000002a9805b674 in handle_wc (hca=0xb80670, cq=0, wc=0x7fbfffdfd0)
   at btl_openib_component.c:2397
#3 0x0000002a9805bbef in poll_hca (hca=0xb80670, count=1)
   at btl_openib_component.c:2508
#4 0x0000002a9805c1ac in progress_one_hca (hca=0xb80670)
   at btl_openib_component.c:2616
#5 0x0000002a9805c24f in btl_openib_component_progress ()
   at btl_openib_component.c:2641
#6 0x0000002a97f42308 in mca_bml_r2_progress () at bml_r2.c:93
#7 0x0000002a95a44c2c in opal_progress () at runtime/opal_progress.c:187
#8 0x0000002a97d1f10c in opal_condition_wait (c=0x2a958b8b40, m=0x2a958b8bc0)
   at ../../../../opal/threads/condition.h:100
#9 0x0000002a97d1ef88 in ompi_request_wait_completion (req=0xbdfc80)
   at ../../../../ompi/request/request.h:381
#10 0x0000002a97d1ee64 in mca_pml_ob1_recv (addr=0xc52d14, count=1,
   datatype=0x2a958abe60, src=1, tag=-19, comm=0xbe0cf0, status=0x0)
   at pml_ob1_irecv.c:104
#11 0x0000002a98c1b182 in ompi_coll_tuned_gather_intra_basic_linear (
   sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
   rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0, module=0xda00e0)
   at coll_tuned_gather.c:408
#12 0x0000002a98c07fc1 in ompi_coll_tuned_gather_intra_dec_fixed (
   sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
   rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0, module=0xda00e0)
   at coll_tuned_decision_fixed.c:723
#13 0x0000002a95715f0f in PMPI_Gather (sendbuf=0x7fbfffe984, sendcount=1,
   sendtype=0x2a958abe60, recvbuf=0xc52d10, recvcount=1,
   recvtype=0x2a958abe60, root=0, comm=0xbe0cf0) at pgather.c:141

This same build works fine with the TCP component and at least doesn't
crash with 1.2.6. The only thing that may be unusual about my build
of openmpi 1.3 is that it is configured with --without-memory-manager
(it seems to cause crashes in another library I use). I tried
rebuilding, omitting --without-memory-manager, but it failed in the
same way.


> On May 29, 2008, at 3:50 AM, Matt Hughes wrote:
>> I have a program which uses MPI::Comm::Spawn to start processes on
>> compute nodes (c0-0, c0-1, etc). The communication between the
>> compute nodes consists of ISend and IRecv pairs, while communication
>> between the compute nodes consists of gather and bcast operations.
>> After executing ~80 successful loops (gather/bcast pairs), I get this
>> error message from the head node process during a gather call:
>> [0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
>> from headnode.local to: c0-0 error polling HP CQ with status WORK
>> REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1
>> The relevant environment variables:
>> OMPI_MCA_btl_openib_rd_num=128
>> OMPI_MCA_btl_openib_verbose=1
>> OMPI_MCA_btl_base_verbose=1
>> OMPI_MCA_btl_openib_rd_low=75
>> OMPI_MCA_btl_base_debug=1
>> OMPI_MCA_btl_openib_warn_no_hca_params_found=0
>> OMPI_MCA_btl_openib_warn_default_gid_prefix=0
>> OMPI_MCA_btl=self,openib
>> If rd_low and rd_num are left at their default values, the program
>> simply hangs in the gather call after about 20 iterations (a gather
>> and a bcast).
>> Can anyone shed any light on what this error message means or what
>> might be done about it?
>> Thanks,
>> mch
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> users mailing list
> users_at_[hidden]