Subject: [MTT users] RETRY EXCEEDED ERROR
From: Rafael Folco (rfolco_at_[hidden])
Date: 2008-07-31 11:43:54


Hi,

I need some help, please.

I'm running a set of MTT tests on my cluster and I have issues in a
particular node.

[0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from
10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED
ERROR status number 12 for wr_id 268870712 opcode 0

I am able to ping from 10.2.1.90 to 10.2.1.50, and they are visible to
each other in the network, just like the other nodes. I've already
checked the drivers, reinstalled openmpi, but nothing changes...

On 10.2.1.90:
# ping 10.2.1.50
PING 10.2.1.50 (10.2.1.50) 56(84) bytes of data.
64 bytes from 10.2.1.50: icmp_seq=1 ttl=64 time=9.95 ms
64 bytes from 10.2.1.50: icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from 10.2.1.50: icmp_seq=3 ttl=64 time=0.114 ms

The cable connections are the same to every node and all tests run fine
without 10.2.1.90. In the other hand, when I add 10.2.1.90 to the
hostlist, I get many failures.

Please, could someone tell me why 10.2.1.90 doesn't like 10.2.1.50 ? Any
clue?

I don't see any problems with other combination of nodes. This is very
very weird.

MTT Results Summary
hostname: p6ihopenhpc1-ib0
uname: Linux p6ihopenhpc1-ib0 2.6.16.60-0.21-ppc64 #1 SMP Tue May 6
12:41:02 UTC 2008 ppc64 ppc64 ppc64 GNU/Linux
who am i: root pts/3 Jul 31 13:31 (elm3b150:S.0)
+-------------+-----------------+------+------+----------+------+
| Phase | Section | Pass | Fail | Time out | Skip |
+-------------+-----------------+------+------+----------+------+
| MPI install | openmpi-1.2.5 | 1 | 0 | 0 | 0 |
| Test Build | trivial | 1 | 0 | 0 | 0 |
| Test Build | ibm | 1 | 0 | 0 | 0 |
| Test Build | onesided | 1 | 0 | 0 | 0 |
| Test Build | mpicxx | 1 | 0 | 0 | 0 |
| Test Build | imb | 1 | 0 | 0 | 0 |
| Test Build | netpipe | 1 | 0 | 0 | 0 |
| Test Run | trivial | 4 | 4 | 0 | 0 |
| Test Run | ibm | 59 | 120 | 0 | 3 |
| Test Run | onesided | 95 | 37 | 0 | 0 |
| Test Run | mpicxx | 0 | 1 | 0 | 0 |
| Test Run | imb correctness | 0 | 1 | 0 | 0 |
| Test Run | imb performance | 0 | 12 | 0 | 0 |
| Test Run | netpipe | 1 | 0 | 0 | 0 |
+-------------+-----------------+------+------+----------+------+

I also attached one of the errors here.

Thanks in advance,

Rafael

-- 
Rafael Folco
OpenHPC / Brazil Test Lead
IBM Linux Technology Center
E-Mail: rfolco_at_[hidden]

| command | mpirun --hostfile /tmp/ompi-core-testers/hosts.list -np 8 --mca btl |
| | openib,self --mca btl_openib_warn_default_gid_prefix 0 --prefix |
| | /usr/lib/mpi/gcc/openmpi collective/gather |
| duration | 0 seconds |
| exit_value | 143 |
| result_message | Failed; exit status: 143 |
| result_stdout | [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from |
| | 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED ERROR |
| | status number 12 for wr_id 268870712 opcode 0 |
| | -------------------------------------------------------------------------- |
| | The InfiniBand retry count between two MPI processes has been |
| | exceeded. "Retry count" is defined in the InfiniBand spec 1.2 |
| | (section 12.7.38): |
| | |
| | The total number of times that the sender wishes the receiver to |
| | retry timeout, packet sequence, etc. errors before posting a |
| | completion error. |
| | |
| | This error typically means that there is something awry within the |
| | InfiniBand fabric itself. You should note the hosts on which this |
| | error has occurred; it has been observed that rebooting or removing a |
| | particular host from the job can sometimes resolve this issue. |
| | |
| | Two MCA parameters can be used to control Open MPI's behavior with |
| | respect to the retry count: |
| | |
| | * btl_openib_ib_retry_count - The number of times the sender will |
| | attempt to retry (defaulted to 7, the maximum value). |
| | |
| | * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted |
| | to 10). The actual timeout value used is calculated as: |
| | |
| | 4.096 microseconds * (2^btl_openib_ib_timeout) |
| | |
| | See the InfiniBand spec 1.2 (section 12.7.34) for more details. |