Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib RETRY EXCEEDED ERROR
From: Åke Sandgren (ake.sandgren_at_[hidden])
Date: 2009-02-27 12:09:38


On Fri, 2009-02-27 at 09:54 -0700, Matt Hughes wrote:
> 2009/2/26 Brett Pemberton <brett_at_[hidden]>:
> > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
> > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
> > number 12 for wr_id 38996224 opcode 0 qp_idx 0
>
> What OS are you using? I've seen this error and many other Infiniband
> related errors on RedHat enterprise linux 4 update 4, with ConnectX
> cards and various versions of OFED, up to version 1.3. Depending on
> the MCA parameters, I also see hangs often enough to make native
> Infiniband unusable on this OS.
>
> However, the openib btl works just fine on the same hardware and the
> same OFED/OpenMPI stack when used with Centos 4.6. I suspect there
> may be something about the kernel that is contributing to these
> problems, but I haven't had a chance to test the kernel from 4.6 on
> 4.4.

We see these errors fairly frequently on our CentOS 5.2 system with
Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2
uses. Has anyone tested that with the 1.4 OFED stack?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake_at_[hidden]   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se