Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-04 16:02:06

This *usually* indicates a physical / layer 0 problem in your IB
fabric. You should do a diagnostic on your HCAs, cables, and switches.

Increasing the timeout value should only be necessary on very large IB
fabrics and/or very congested networks.

On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:

> I found several reports on the openmpi users mailing list from users,
> who need to bump up the default value for btl_openib_ib_timeout.
> We also have some applications on our cluster, that have problems,
> unless we set this value from the default 10 to 15:
> [24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
> to: shc175
> error polling LP CQ with status RETRY EXCEEDED ERROR status number
> 12 for
> wr_id 250450816 opcode 11048 qp_idx 3
> This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> Is this normal or is it an indicator of other problems, maybe
> related to
> hardware?
> Are there other parameters that need to be looked at too?
> Thanks for any insight on this!
> Regards,
> Jan Lindheim
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems