Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Infiniband error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-11-12 10:45:11


It would be best if an IB vendor replies (hint hint!), but it is likely that you have some kind of hardware issue on that node (e.g., a bad / flakey HCA, etc.). You should probably run a full set of layer-0 diagnostics on your fabric to make sure it's clean.

I say this because back when Cisco was an IB vendor, when I ran into weird-o problems like this, they were almost always due to hardware issues (e.g., replace the HCA and then all was fine).

Consult your IB vendor's documentation on how to run layer-0 diagnostics.

On Nov 4, 2010, at 7:39 PM, Ondrej Marsalek wrote:

> Dear all,
>
> I would like to ask for help with understanding an error message I get
> when communication using Open MPI 1.4.1 over Infiniband fails. After
> several hours of operation, communication with one particular node
> (f24) fails with something like:
>
> [[20265,1],79][btl_openib_component.c:2951:handle_wc] from f05 to: f24
> error polling LP CQ with status INVALID REQUEST ERROR status number 9
> for wr_id 309134592 opcode 1 vendor error 138 qp_idx 2
> [[20265,1],39][btl_openib_component.c:2951:handle_wc] from f24 to: f05
> error polling LP CQ with status WORK REQUEST FLUSHED ERROR status
> number 5 for wr_id 313731584 opcode 1 vendor error 249 qp_idx 2
>
> This is reproducible in the sense that it happens repeatedly, but so
> far I was not able to create a test case that would trigger the
> problem. It happens after hours of smooth operation. One of the nodes
> involved is always f24. When I leave it out of the job, I get stable a
> run with no trouble. Is this a hardware error or something else? Is
> there something I can do try to locate the problem better? Where can I
> find what the error codes mean?
>
> Thanks,
> Ondrej Marsalek
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/