I'm trying to run a 64 way mpi benchmark on my system. I
*always* get the following error and I'm wondering how do
I debug the problem node? I can not reproduce the problem
with a smaller number of nodes.
snip...
[p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
...
It looks like I have well over 128 lines of similar output. A quick
eyeball of
the output seems to indicate about 1/2 of all nodes are reporting this
problem.
I have checked the error counters on my IB switch and I
have 0 new errors during the run.
TIA.
R.
|