Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RETRY EXCEEDED ERROR
From: Jan Lindheim (lindheim_at_[hidden])
Date: 2009-03-04 16:16:22


On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and switches.
>
> Increasing the timeout value should only be necessary on very large IB
> fabrics and/or very congested networks.

Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.

Jan

>
>
> On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
>
> >I found several reports on the openmpi users mailing list from users,
> >who need to bump up the default value for btl_openib_ib_timeout.
> >We also have some applications on our cluster, that have problems,
> >unless we set this value from the default 10 to 15:
> >
> >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
> >to: shc175
> >error polling LP CQ with status RETRY EXCEEDED ERROR status number
> >12 for
> >wr_id 250450816 opcode 11048 qp_idx 3
> >
> >This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> >
> >Is this normal or is it an indicator of other problems, maybe
> >related to
> >hardware?
> >Are there other parameters that need to be looked at too?
> >
> >Thanks for any insight on this!
> >
> >Regards,
> >Jan Lindheim
> >_______________________________________________
> >users mailing list
> >users_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>