Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib RETRY EXCEEDED ERROR
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-02-27 12:33:26


Usually "retry exceeded error" points to some network issues, like bad
cable or some bad connector. You may use ibdiagnet tool for the network
debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.

Pasha

Brett Pemberton wrote:
> Hey,
>
> I've had a couple of errors recently, of the form:
>
> [[1176,1],0][btl_openib_component.c:2905:handle_wc] from
> tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
> EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
> --------------------------------------------------------------------------
>
> The InfiniBand retry count between two MPI processes has been
> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
> My first thought was to increase the retry count, but it is already at
> maximum.
>
> I've checked connections between the two nodes, and they seem ok
>
> [root_at_tango090 ~]# ibv_rc_pingpong
> local address: LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
> remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
> 8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
> 1000 iters in 0.07 seconds = 65.74 usec/iter
>
> How can I stop this happening in the future, without increasing the
> retry count?
>
> cheers,
>
> / Brett
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users