On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote:
> On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
> >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> >> This *usually* indicates a physical / layer 0 problem in your IB
> >> fabric. You should do a diagnostic on your HCAs, cables, and
> >> Increasing the timeout value should only be necessary on very
> >large IB
> >> fabrics and/or very congested networks.
> >Thanks Jeff!
> >What is considered to be very large IB fabrics?
> >I assume that with just over 180 compute nodes,
> >our cluster does not fall into this category.
> I was a little misleading in my note -- I should clarify. It's really
> congestion that matters, not the size of the fabric. Congestion is
> potentially more likely to happen in larger fabrics, since packets may
> have to flow through more switches, there's likely more apps running
> on the cluster, etc. But it's all very application/cluster-specific;
> only you can know if your fabric is heavily congested based on what
> you run on it, etc.
> Jeff Squyres
> Cisco Systems
> users mailing list
Thanks again Jeff!
Time to dig up diagnostics tools and look at port statistics.