Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error polling LP CQ with status RETRY EXCEEDED ERROR
From: Gary Draving (gbd2_at_[hidden])
Date: 2009-03-27 11:22:33


Thanks for the advice, we tried "-mca btl_openib_ib_min_rnr_timer 25
-mca btl_openib_ib_timeout 20" but we are still getting errors as we
increase the Ns of HPL.dat value into the thousands. Is it ok to just
add these valuse to .openmpi/mca-params.conf for the user running the
test or should we add these setting to each node in
/usr/local/etc/openmpi-mca-params.conf

The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.

  Local host: compute-0-8.local
  MPI process PID: 30544
  Error number: 10 (IBV_EVENT_PORT_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.

Ralph Castain wrote:
> The default retry values are wrong and will be corrected in the next
> OMPI release. For now, try running with:
>
> -mca btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20
>
> Should work.
> Ralph
>
> On Mar 26, 2009, at 2:16 PM, Gary Draving wrote:
>
>> Hi Everyone,
>>
>> I'm doing some performance testing using HPL with TCP turned off. My
>> HPL.dat file looks like the following:
>> It seems to work well for lower Ns values but as I increase that
>> value it inevitably fails with
>> "[[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3"
>>
>> The machines in this test are all dual core quads with built in
>> mellanox cards for total of 320 processors
>>
>> It seems like once I reach a certain nuber of "Ns" the infiniban
>> starts having problems. Increasing "btl_openib_ib_retry_count" and
>> "btl_openib_ib_timeout" to the max allowed us to get from 4096 to
>> 8192 Ns but the error came back at around 8192.
>>
>> If anyone has some ideas on this problem I would be very interests,
>> Thanks
>>
>> ((((((((((((((((((HPL.dat file being uses )))))))))))))))))))
>>
>> HPLinpack benchmark input file
>> Innovative Computing Laboratory, University of Tennessee
>> HPL.out output file name (if any)
>> 6 device out (6=stdout,7=stderr,file)
>> 1 # of problems sizes (N)
>> 8192 Ns
>> 1 # of NBs
>> 256 NBs
>> 0 PMAP process mapping (0=Row-,1=Column-major)
>> 1 # of process grids (P x Q)
>> 19 Ps
>> 19 Qs
>> (defaults for rest)
>>
>> (((((((((((((((((( Full error printout ))))))))))))))))))
>>
>> [[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3
>> --------------------------------------------------------------------------
>>
>> The InfiniBand retry count between two MPI processes has been
>> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>
>> The total number of times that the sender wishes the receiver to
>> retry timeout, packet sequence, etc. errors before posting a
>> completion error.
>>
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself. You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>>
>> * btl_openib_ib_retry_count - The number of times the sender will
>> attempt to retry (defaulted to 7, the maximum value).
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>> to 10). The actual timeout value used is calculated as:
>>
>> 4.096 microseconds * (2^btl_openib_ib_timeout)
>>
>> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>>
>> Below is some information about the host that raised the error and the
>> peer to which it was connected:
>>
>> Local host: compute-0-0.local
>> Local device: mthca0
>> Peer host: compute-0-8
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>> mpirun has exited due to process rank 169 with PID 26725 on
>> node compute-0-0 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users