Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problem: error polling HP CQ...
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:37:53


Errr... That's not good. :-(

Do you have a small example that you can share that duplicates the
problem?

On Jun 6, 2008, at 1:51 AM, Matt Hughes wrote:

> 2008/6/4 Jeff Squyres <jsquyres_at_[hidden]>:
>> Would it be possible for you to try a trunk nightly tarball snapshot,
>> perchance?
>
> I have attempted to use openmpi-1.3a1r18569. After some pain getting
> MPI_Comm_spawn to work (I will write about that in a separate
> message), I was able to get my app started. If segfaults in
> btl_openib_handle_incoming() by dereferencing a null pointer:
>
> #0 0x0000000000000000 in ?? ()
> #1 0x0000002a98059777 in btl_openib_handle_incoming
> (openib_btl=0xb8b900,
> ep=0xbecb70, frag=0xc8da80, byte_len=24) at btl_openib_component.c:
> 2129
> #2 0x0000002a9805b674 in handle_wc (hca=0xb80670, cq=0,
> wc=0x7fbfffdfd0)
> at btl_openib_component.c:2397
> #3 0x0000002a9805bbef in poll_hca (hca=0xb80670, count=1)
> at btl_openib_component.c:2508
> #4 0x0000002a9805c1ac in progress_one_hca (hca=0xb80670)
> at btl_openib_component.c:2616
> #5 0x0000002a9805c24f in btl_openib_component_progress ()
> at btl_openib_component.c:2641
> #6 0x0000002a97f42308 in mca_bml_r2_progress () at bml_r2.c:93
> #7 0x0000002a95a44c2c in opal_progress () at runtime/
> opal_progress.c:187
> #8 0x0000002a97d1f10c in opal_condition_wait (c=0x2a958b8b40,
> m=0x2a958b8bc0)
> at ../../../../opal/threads/condition.h:100
> #9 0x0000002a97d1ef88 in ompi_request_wait_completion (req=0xbdfc80)
> at ../../../../ompi/request/request.h:381
> #10 0x0000002a97d1ee64 in mca_pml_ob1_recv (addr=0xc52d14, count=1,
> datatype=0x2a958abe60, src=1, tag=-19, comm=0xbe0cf0, status=0x0)
> at pml_ob1_irecv.c:104
> #11 0x0000002a98c1b182 in ompi_coll_tuned_gather_intra_basic_linear (
> sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
> rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0,
> module=0xda00e0)
> at coll_tuned_gather.c:408
> #12 0x0000002a98c07fc1 in ompi_coll_tuned_gather_intra_dec_fixed (
> sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
> rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0,
> module=0xda00e0)
> at coll_tuned_decision_fixed.c:723
> #13 0x0000002a95715f0f in PMPI_Gather (sendbuf=0x7fbfffe984,
> sendcount=1,
> sendtype=0x2a958abe60, recvbuf=0xc52d10, recvcount=1,
> recvtype=0x2a958abe60, root=0, comm=0xbe0cf0) at pgather.c:141
>
> This same build works fine with the TCP component and at least doesn't
> crash with 1.2.6. The only thing that may be unusual about my build
> of openmpi 1.3 is that it is configured with --without-memory-manager
> (it seems to cause crashes in another library I use). I tried
> rebuilding, omitting --without-memory-manager, but it failed in the
> same way.
>
> mch
>
>
>
>
>> On May 29, 2008, at 3:50 AM, Matt Hughes wrote:
>>
>>> I have a program which uses MPI::Comm::Spawn to start processes on
>>> compute nodes (c0-0, c0-1, etc). The communication between the
>>> compute nodes consists of ISend and IRecv pairs, while communication
>>> between the compute nodes consists of gather and bcast operations.
>>> After executing ~80 successful loops (gather/bcast pairs), I get
>>> this
>>> error message from the head node process during a gather call:
>>>
>>> [0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
>>> from headnode.local to: c0-0 error polling HP CQ with status WORK
>>> REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1
>>>
>>> The relevant environment variables:
>>> OMPI_MCA_btl_openib_rd_num=128
>>> OMPI_MCA_btl_openib_verbose=1
>>> OMPI_MCA_btl_base_verbose=1
>>> OMPI_MCA_btl_openib_rd_low=75
>>> OMPI_MCA_btl_base_debug=1
>>> OMPI_MCA_btl_openib_warn_no_hca_params_found=0
>>> OMPI_MCA_btl_openib_warn_default_gid_prefix=0
>>> OMPI_MCA_btl=self,openib
>>>
>>> If rd_low and rd_num are left at their default values, the program
>>> simply hangs in the gather call after about 20 iterations (a gather
>>> and a bcast).
>>>
>>> Can anyone shed any light on what this error message means or what
>>> might be done about it?
>>>
>>> Thanks,
>>> mch
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems