Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error polling LP CQ with status RETRYEXCEEDED ERROR
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-30 21:13:34


On Mar 27, 2009, at 11:22 AM, Gary Draving wrote:

> Thanks for the advice, we tried "-mca btl_openib_ib_min_rnr_timer 25
> -mca btl_openib_ib_timeout 20" but we are still getting errors as we
> increase the Ns of HPL.dat value into the thousands. Is it ok to just
> add these valuse to .openmpi/mca-params.conf for the user running the
> test or should we add these setting to each node in
> /usr/local/etc/openmpi-mca-params.conf
>

It would be better to put them in the /usr/local/... file so that all
your users get those values without needing to do anything.

> The OpenFabrics stack has reported a network error event. Open MPI
> will try to continue, but your job may end up failing.
>
> Local host: compute-0-8.local
> MPI process PID: 30544
> Error number: 10 (IBV_EVENT_PORT_ERR)
>
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
>

This is different than your prior error -- it may indicate a problem
with your IB fabric itself. As such, I think increasing the timer
values fixed the RER problem, but then this [new] error showed up.

>
>
>
> Ralph Castain wrote:
> > The default retry values are wrong and will be corrected in the next
> > OMPI release. For now, try running with:
> >
> > -mca btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20
> >
> > Should work.
> > Ralph
> >
> > On Mar 26, 2009, at 2:16 PM, Gary Draving wrote:
> >
> >> Hi Everyone,
> >>
> >> I'm doing some performance testing using HPL with TCP turned
> off. My
> >> HPL.dat file looks like the following:
> >> It seems to work well for lower Ns values but as I increase that
> >> value it inevitably fails with
> >> "[[13535,1],169][btl_openib_component.c:2905:handle_wc] from
> >> compute-0-0.local to: compute-0-8 error polling LP CQ with status
> >> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
> >> qp_idx 3"
> >>
> >> The machines in this test are all dual core quads with built in
> >> mellanox cards for total of 320 processors
> >>
> >> It seems like once I reach a certain nuber of "Ns" the infiniban
> >> starts having problems. Increasing "btl_openib_ib_retry_count" and
> >> "btl_openib_ib_timeout" to the max allowed us to get from 4096 to
> >> 8192 Ns but the error came back at around 8192.
> >>
> >> If anyone has some ideas on this problem I would be very interests,
> >> Thanks
> >>
> >> ((((((((((((((((((HPL.dat file being uses )))))))))))))))))))
> >>
> >> HPLinpack benchmark input file
> >> Innovative Computing Laboratory, University of Tennessee
> >> HPL.out output file name (if any)
> >> 6 device out (6=stdout,7=stderr,file)
> >> 1 # of problems sizes (N)
> >> 8192 Ns
> >> 1 # of NBs
> >> 256 NBs
> >> 0 PMAP process mapping (0=Row-,1=Column-major)
> >> 1 # of process grids (P x Q)
> >> 19 Ps
> >> 19 Qs
> >> (defaults for rest)
> >>
> >> (((((((((((((((((( Full error printout ))))))))))))))))))
> >>
> >> [[13535,1],169][btl_openib_component.c:2905:handle_wc] from
> >> compute-0-0.local to: compute-0-8 error polling LP CQ with status
> >> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
> >> qp_idx 3
> >>
> --------------------------------------------------------------------------
> >>
> >> The InfiniBand retry count between two MPI processes has been
> >> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> >> (section 12.7.38):
> >>
> >> The total number of times that the sender wishes the receiver to
> >> retry timeout, packet sequence, etc. errors before posting a
> >> completion error.
> >>
> >> This error typically means that there is something awry within the
> >> InfiniBand fabric itself. You should note the hosts on which this
> >> error has occurred; it has been observed that rebooting or
> removing a
> >> particular host from the job can sometimes resolve this issue.
> >> Two MCA parameters can be used to control Open MPI's behavior with
> >> respect to the retry count:
> >>
> >> * btl_openib_ib_retry_count - The number of times the sender will
> >> attempt to retry (defaulted to 7, the maximum value).
> >> * btl_openib_ib_timeout - The local ACK timeout parameter
> (defaulted
> >> to 10). The actual timeout value used is calculated as:
> >>
> >> 4.096 microseconds * (2^btl_openib_ib_timeout)
> >>
> >> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> >>
> >> Below is some information about the host that raised the error
> and the
> >> peer to which it was connected:
> >>
> >> Local host: compute-0-0.local
> >> Local device: mthca0
> >> Peer host: compute-0-8
> >>
> >> You may need to consult with your system administrator to get this
> >> problem fixed.
> >>
> --------------------------------------------------------------------------
> >>
> >>
> --------------------------------------------------------------------------
> >>
> >> mpirun has exited due to process rank 169 with PID 26725 on
> >> node compute-0-0 exiting without calling "finalize". This may
> >> have caused other processes in the application to be
> >> terminated by signals sent by mpirun (as reported here).
> >>
> --------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems