From: Rafael Folco (rfolco_at_[hidden])
Date: 2008-07-31 11:43:54


I need some help, please.

I'm running a set of MTT tests on my cluster and I have issues in a
particular node.

[0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from to: error polling HP CQ with status RETRY EXCEEDED
ERROR status number 12 for wr_id 268870712 opcode 0

I am able to ping from to, and they are visible to
each other in the network, just like the other nodes. I've already
checked the drivers, reinstalled openmpi, but nothing changes...

# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=9.95 ms
64 bytes from icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from icmp_seq=3 ttl=64 time=0.114 ms

The cable connections are the same to every node and all tests run fine
without In the other hand, when I add to the
hostlist, I get many failures.

Please, could someone tell me why doesn't like ? Any

I don't see any problems with other combination of nodes. This is very
very weird.

MTT Results Summary
hostname: p6ihopenhpc1-ib0
uname: Linux p6ihopenhpc1-ib0 #1 SMP Tue May 6
12:41:02 UTC 2008 ppc64 ppc64 ppc64 GNU/Linux
who am i: root pts/3 Jul 31 13:31 (elm3b150:S.0)
| Phase | Section | Pass | Fail | Time out | Skip |
| MPI install | openmpi-1.2.5 | 1 | 0 | 0 | 0 |
| Test Build | trivial | 1 | 0 | 0 | 0 |
| Test Build | ibm | 1 | 0 | 0 | 0 |
| Test Build | onesided | 1 | 0 | 0 | 0 |
| Test Build | mpicxx | 1 | 0 | 0 | 0 |
| Test Build | imb | 1 | 0 | 0 | 0 |
| Test Build | netpipe | 1 | 0 | 0 | 0 |
| Test Run | trivial | 4 | 4 | 0 | 0 |
| Test Run | ibm | 59 | 120 | 0 | 3 |
| Test Run | onesided | 95 | 37 | 0 | 0 |
| Test Run | mpicxx | 0 | 1 | 0 | 0 |
| Test Run | imb correctness | 0 | 1 | 0 | 0 |
| Test Run | imb performance | 0 | 12 | 0 | 0 |
| Test Run | netpipe | 1 | 0 | 0 | 0 |

I also attached one of the errors here.

Thanks in advance,


Rafael Folco
OpenHPC / Brazil Test Lead
IBM Linux Technology Center
E-Mail: rfolco_at_[hidden]

| command | mpirun --hostfile /tmp/ompi-core-testers/hosts.list -np 8 --mca btl |
| | openib,self --mca btl_openib_warn_default_gid_prefix 0 --prefix |
| | /usr/lib/mpi/gcc/openmpi collective/gather |
| duration | 0 seconds |
| exit_value | 143 |
| result_message | Failed; exit status: 143 |
| result_stdout | [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from |
| | to: error polling HP CQ with status RETRY EXCEEDED ERROR |
| | status number 12 for wr_id 268870712 opcode 0 |
| | -------------------------------------------------------------------------- |
| | The InfiniBand retry count between two MPI processes has been |
| | exceeded. "Retry count" is defined in the InfiniBand spec 1.2 |
| | (section 12.7.38): |
| | |
| | The total number of times that the sender wishes the receiver to |
| | retry timeout, packet sequence, etc. errors before posting a |
| | completion error. |
| | |
| | This error typically means that there is something awry within the |
| | InfiniBand fabric itself. You should note the hosts on which this |
| | error has occurred; it has been observed that rebooting or removing a |
| | particular host from the job can sometimes resolve this issue. |
| | |
| | Two MCA parameters can be used to control Open MPI's behavior with |
| | respect to the retry count: |
| | |
| | * btl_openib_ib_retry_count - The number of times the sender will |
| | attempt to retry (defaulted to 7, the maximum value). |
| | |
| | * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted |
| | to 10). The actual timeout value used is calculated as: |
| | |
| | 4.096 microseconds * (2^btl_openib_ib_timeout) |
| | |
| | See the InfiniBand spec 1.2 (section 12.7.34) for more details. |