This *usually* indicates a physical / layer 0 problem in your IB
fabric. You should do a diagnostic on your HCAs, cables, and switches.
Increasing the timeout value should only be necessary on very large IB
fabrics and/or very congested networks.
On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
> I found several reports on the openmpi users mailing list from users,
> who need to bump up the default value for btl_openib_ib_timeout.
> We also have some applications on our cluster, that have problems,
> unless we set this value from the default 10 to 15:
> [24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
> to: shc175
> error polling LP CQ with status RETRY EXCEEDED ERROR status number
> 12 for
> wr_id 250450816 opcode 11048 qp_idx 3
> This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> Is this normal or is it an indicator of other problems, maybe
> related to
> Are there other parameters that need to be looked at too?
> Thanks for any insight on this!
> Jan Lindheim
> users mailing list