Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RETRY EXCEEDED ERROR status number 12
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-08-21 17:44:54


You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet

The tool is part of OFED (http://www.openfabrics.org/)

Pasha.

Prentice Bisbal wrote:
> Several jobs on my cluster just died with the error below.
>
> Are there any IB/Open MPI diagnostics I should use to diagnose, should I
> just reboot the nodes, or should I have the user who submitted these
> jobs just increase the retry count/timeout paramters?
>
>
> [0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress]
> from node14.aurora to: node40.aurora error polling HP CQ with status
> RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 11119
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself. You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
>
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
>
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>
>