Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problems with "error polling LP CQ with status RNR"
From: Åke Sandgren (ake.sandgren_at_[hidden])
Date: 2009-05-14 09:37:45


On Thu, 2009-05-14 at 09:24 -0400, Jeff Squyres wrote:
> On May 13, 2009, at 4:55 PM, Ã…ke Sandgren wrote:
>
> > I'm having problem with getting the "error polling LP CQ with status
> > RNR..." on an otherwise completely empty system.
> > There are no errors visible in the error counters in any of the HCAs
> > or
> > switches or anywhere else.
> >
> > I'm running OMPI 1.3.2 built with pathscale 3.2
> >
> > If i add -mca btl 'ofud,self,sm' the same code works ok.
> >
>
> Interesting. I have only done very limited testing with ofud; are you
> saying that you get these errors if you "--mca btl openib,sm,self"?

I think i have tested it but at the moment i'm not sure. I will do more
tests later.
(Busy doing firmware upgrades...)

> > It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e.
> > 8x8
> > works ok.
> >
> > This might very well be a pathscale problem since when running with
> > the
> > debug version of ompi 1.3.2 the problem goes away.
> >
> > Complete error is:
> > error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED
> > ERROR
> > status number 13 for wr_id 465284992 opcode -1 vendor error 135
> > qp_idx
> > 0
> >
> > Any ideas to where in the ompi code i should start reducing
> > optimization
> > levels to pinpoint this?
> >
>
>
> Do you have a simple reproducer test case, perchance?

Unfortunately no. Have only seen this reproducibly on large jobs.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake_at_[hidden]   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se