Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange Net problem
From: Gabriele Fatigati (g.fatigati_at_[hidden])
Date: 2009-04-01 09:57:49


Hi Ralph,
unfortunately, in this machine i can't upgrade OpenMPI at the moment.
Is there a way to limit or to reduce the probability of this error?

2009/4/1 Ralph Castain <rhc_at_[hidden]>:
> Hi Gabriele
>
> I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to
> that size due to a requirement that the underlying out-of-band system fully
> connect at the TCP level. Thus, every process in your job will be opening
> 2002 sockets (one to every other process, one to the local orted, and one
> back to mpirun). More than likely, you are simply running out of sockets on
> your nodes.
>
> For a job this size, I would recommend upgrading to OMPI 1.3.1. This uses a
> routing scheme for the out-of-band system, so each process only opens 1
> socket to its local daemon. Much more scalable, and I think it would solve
> this problem. It will also start much faster, as a bonus.
>
> HTH
> Ralph
>
>
> On Apr 1, 2009, at 3:58 AM, Gabriele Fatigati wrote:
>
>> Dear OpenMPI developers, m
>> i have a strange problem during running my application ( 2000
>> processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
>> the mca-params.conf:
>>
>>
>> btl = ^tcp
>> btl_tcp_if_exclude = eth0,ib0,ib1
>> oob_tcp_include = eth1,lo,eth0
>> btl_openib_warn_default_gid_prefix = 0
>> btl_openib_ib_timeout   = 20
>>
>> At certain point of my run, the application died with this message:
>>
>> [node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36645 failed: Software caused connection abort
>> (103)
>> [node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed, connecting over all interfaces
>> failed!
>>
>> My question is: This error depends by some timeout? How can i solve?
>> Thanks in advance.
>>
>> Than
>>
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it                    Tel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it                    Tel:   +39 051 6171722
g.fatigati [AT] cineca.it