This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
I would like to ask for help with understanding an error message I get
when communication using Open MPI 1.4.1 over Infiniband fails. After
several hours of operation, communication with one particular node
(f24) fails with something like:
[[20265,1],79][btl_openib_component.c:2951:handle_wc] from f05 to: f24
error polling LP CQ with status INVALID REQUEST ERROR status number 9
for wr_id 309134592 opcode 1 vendor error 138 qp_idx 2
[[20265,1],39][btl_openib_component.c:2951:handle_wc] from f24 to: f05
error polling LP CQ with status WORK REQUEST FLUSHED ERROR status
number 5 for wr_id 313731584 opcode 1 vendor error 249 qp_idx 2
This is reproducible in the sense that it happens repeatedly, but so
far I was not able to create a test case that would trigger the
problem. It happens after hours of smooth operation. One of the nodes
involved is always f24. When I leave it out of the job, I get stable a
run with no trouble. Is this a hardware error or something else? Is
there something I can do try to locate the problem better? Where can I
find what the error codes mean?