Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Strange Net problem
From: Gabriele Fatigati (g.fatigati_at_[hidden])
Date: 2009-04-01 05:58:20


Dear OpenMPI developers, m
i have a strange problem during running my application ( 2000
processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
the mca-params.conf:

btl = ^tcp
btl_tcp_if_exclude = eth0,ib0,ib1
oob_tcp_include = eth1,lo,eth0
btl_openib_warn_default_gid_prefix = 0
btl_openib_ib_timeout = 20

At certain point of my run, the application died with this message:

[node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36645 failed: Software caused connection abort
(103)
[node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed, connecting over all interfaces
failed!

My question is: This error depends by some timeout? How can i solve?
Thanks in advance.

Than

-- 
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it                    Tel:   +39 051 6171722
g.fatigati [AT] cineca.it