This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and switches.
> Increasing the timeout value should only be necessary on very large IB
> fabrics and/or very congested networks.
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.
> On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
> >I found several reports on the openmpi users mailing list from users,
> >who need to bump up the default value for btl_openib_ib_timeout.
> >We also have some applications on our cluster, that have problems,
> >unless we set this value from the default 10 to 15:
> >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
> >to: shc175
> >error polling LP CQ with status RETRY EXCEEDED ERROR status number
> >12 for
> >wr_id 250450816 opcode 11048 qp_idx 3
> >This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> >Is this normal or is it an indicator of other problems, maybe
> >related to
> >Are there other parameters that need to be looked at too?
> >Thanks for any insight on this!
> >Jan Lindheim
> >users mailing list
> Jeff Squyres
> Cisco Systems
> users mailing list