Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RETRY EXCEEDED ERROR
From: Jan Lindheim (lindheim_at_[hidden])
Date: 2009-03-04 16:45:56


On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote:
> On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
>
> >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> >> This *usually* indicates a physical / layer 0 problem in your IB
> >> fabric. You should do a diagnostic on your HCAs, cables, and
> >switches.
> >>
> >> Increasing the timeout value should only be necessary on very
> >large IB
> >> fabrics and/or very congested networks.
> >
> >Thanks Jeff!
> >What is considered to be very large IB fabrics?
> >I assume that with just over 180 compute nodes,
> >our cluster does not fall into this category.
> >
>
> I was a little misleading in my note -- I should clarify. It's really
> congestion that matters, not the size of the fabric. Congestion is
> potentially more likely to happen in larger fabrics, since packets may
> have to flow through more switches, there's likely more apps running
> on the cluster, etc. But it's all very application/cluster-specific;
> only you can know if your fabric is heavily congested based on what
> you run on it, etc.
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Thanks again Jeff!
Time to dig up diagnostics tools and look at port statistics.

Jan