Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error polling LP CQ with status RETRY EXCEEDED ERROR
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-26 16:33:33


The default retry values are wrong and will be corrected in the next
OMPI release. For now, try running with:

-mca btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20

Should work.
Ralph

On Mar 26, 2009, at 2:16 PM, Gary Draving wrote:

> Hi Everyone,
>
> I'm doing some performance testing using HPL with TCP turned off.
> My HPL.dat file looks like the following:
> It seems to work well for lower Ns values but as I increase that
> value it inevitably fails with "[[13535,1],169]
> [btl_openib_component.c:2905:handle_wc] from compute-0-0.local to:
> compute-0-8 error polling LP CQ with status RETRY EXCEEDED ERROR
> status number 12 for wr_id 209907960 opcode 0 qp_idx 3"
>
> The machines in this test are all dual core quads with built in
> mellanox cards for total of 320 processors
>
> It seems like once I reach a certain nuber of "Ns" the infiniban
> starts having problems. Increasing "btl_openib_ib_retry_count" and
> "btl_openib_ib_timeout" to the max allowed us to get from 4096 to
> 8192 Ns but the error came back at around 8192.
>
> If anyone has some ideas on this problem I would be very interests,
> Thanks
>
> ((((((((((((((((((HPL.dat file being uses )))))))))))))))))))
>
> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out output file name (if any)
> 6 device out (6=stdout,7=stderr,file)
> 1 # of problems sizes (N)
> 8192 Ns
> 1 # of NBs
> 256 NBs
> 0 PMAP process mapping (0=Row-,1=Column-major)
> 1 # of process grids (P x Q)
> 19 Ps
> 19 Qs
> (defaults for rest)
>
> (((((((((((((((((( Full error printout ))))))))))))))))))
>
> [[13535,1],169][btl_openib_component.c:2905:handle_wc] from
> compute-0-0.local to: compute-0-8 error polling LP CQ with status
> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
> qp_idx 3
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself. You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>
> Below is some information about the host that raised the error and the
> peer to which it was connected:
>
> Local host: compute-0-0.local
> Local device: mthca0
> Peer host: compute-0-8
>
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 169 with PID 26725 on
> node compute-0-0 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users