On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and switches.
> Increasing the timeout value should only be necessary on very large IB
> fabrics and/or very congested networks.
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.
> On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
> >I found several reports on the openmpi users mailing list from users,
> >who need to bump up the default value for btl_openib_ib_timeout.
> >We also have some applications on our cluster, that have problems,
> >unless we set this value from the default 10 to 15:
> >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
> >to: shc175
> >error polling LP CQ with status RETRY EXCEEDED ERROR status number
> >12 for
> >wr_id 250450816 opcode 11048 qp_idx 3
> >This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> >Is this normal or is it an indicator of other problems, maybe
> >related to
> >Are there other parameters that need to be looked at too?
> >Thanks for any insight on this!
> >Jan Lindheim
> >users mailing list
> Jeff Squyres
> Cisco Systems
> users mailing list