Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] openib RETRY EXCEEDED ERROR
From: Brett Pemberton (brett_at_[hidden])
Date: 2009-02-26 21:33:49


Hey,

I've had a couple of errors recently, of the form:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

My first thought was to increase the retry count, but it is already at
maximum.

I've checked connections between the two nodes, and they seem ok

[root_at_tango090 ~]# ibv_rc_pingpong
   local address: LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
   remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
1000 iters in 0.07 seconds = 65.74 usec/iter

How can I stop this happening in the future, without increasing the
retry count?

cheers,

        / Brett

-- 
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899