Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib RETRY EXCEEDED ERROR
From: Biagio Lucini (B.Lucini_at_[hidden])
Date: 2009-02-27 05:49:50


Bogdan Costescu wrote:
>
> Brett Pemberton <brett_at_[hidden]> wrote:
>
>> [[1176,1],0][btl_openib_component.c:2905:handle_wc] from
>> tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
>
> I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with
> all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some
> MVAPICH versions, from which I have concluded that the problem lies in
> the lower levels (OFED or IB card firmware). Indeed after the
> installation of OFED 1.3.x and a possible firmware update (not sure
> about the firmware as I don't admin that cluster), these errors have
> disappeared.
>

I can confirm this: I had a similar problem over Christmas, for which I
asked for help in this list. In fact the problem was not with OpenMPI,
but with the OFED stack: an upgrade of the latter (and an upgrade of the
firmware, although once again the OFED drivers were complaining about
the firmware being too old) fixed the problem. We did both upgrades at
once, so as in Brett's case I am not sure which one played the major role.

Biagio

-- 
=========================================================
Dr. Biagio Lucini				
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284
=========================================================