Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib RETRY EXCEEDED ERROR
From: Brett Pemberton (brett_at_[hidden])
Date: 2009-03-01 19:24:48


Matt Hughes wrote:
> 2009/2/26 Brett Pemberton <brett_at_[hidden]>:
>> [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
>> to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
>> number 12 for wr_id 38996224 opcode 0 qp_idx 0
>
> What OS are you using?

Centos 5

   I've seen this error and many other Infiniband
> related errors on RedHat enterprise linux 4 update 4, with ConnectX
> cards and various versions of OFED, up to version 1.3. Depending on
> the MCA parameters, I also see hangs often enough to make native
> Infiniband unusable on this OS.
>

I'd appreciate some advice on if I'm using OFED correctly.

I'm running OFED 1.4, however not the kernel modules, just userland.
Is this a bad idea?

Basically, I recompile the ofed src.rpms for:

dapl, libibcm, libibcommon, libibmad, libibumad, libibverbs, libmthca,
librdmacm, libsdp, mstflint

And install onto CentOS, upgrading the in-distro versions.
Should I also be compiling ofa_kernel ?
Could this be causing problems ?

As explained off-list, I'm running the most recent firmware for my
cards, although the release is quite old:

hca_id: mthca0
         fw_ver: 1.2.0
         node_guid: 0002:c902:0024:3c6c
         sys_image_guid: 0002:c902:0024:3c6f
         vendor_id: 0x02c9
         vendor_part_id: 25204
         hw_ver: 0xA0
         board_id: MT_03B0140001
         phys_port_cnt: 1
                 port: 1
                         state: PORT_ACTIVE (4)
                         max_mtu: 2048 (4)
                         active_mtu: 2048 (4)
                         sm_lid: 1
                         port_lid: 34
                         port_lmc: 0x00

cheers,

        / Brett

-- 
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899