Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Strange Net problem
From: Gabriele Fatigati (g.fatigati_at_[hidden])
Date: 2009-04-01 09:57:49


Hi Ralph,
unfortunately, in this machine i can't upgrade OpenMPI at the moment.
Is there a way to limit or to reduce the probability of this error?

2009/4/1 Ralph Castain <rhc_at_[hidden]>:
> Hi Gabriele
>
> I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to
> that size due to a requirement that the underlying out-of-band system fully
> connect at the TCP level. Thus, every process in your job will be opening
> 2002 sockets (one to every other process, one to the local orted, and one
> back to mpirun). More than likely, you are simply running out of sockets on
> your nodes.
>
> For a job this size, I would recommend upgrading to OMPI 1.3.1. This uses a
> routing scheme for the out-of-band system, so each process only opens 1
> socket to its local daemon. Much more scalable, and I think it would solve
> this problem. It will also start much faster, as a bonus.
>
> HTH
> Ralph
>
>
> On Apr 1, 2009, at 3:58 AM, Gabriele Fatigati wrote:
>
>> Dear OpenMPI developers, m
>> i have a strange problem during running my application ( 2000
>> processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
>> the mca-params.conf:
>>
>>
>> btl = ^tcp
>> btl_tcp_if_exclude = eth0,ib0,ib1
>> oob_tcp_include = eth1,lo,eth0
>> btl_openib_warn_default_gid_prefix = 0
>> btl_openib_ib_timeout   = 20
>>
>> At certain point of my run, the application died with this message:
>>
>> [node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36645 failed: Software caused connection abort
>> (103)
>> [node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed: Software caused connection abort
>> (103)
>> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
>> connect to 10.161.12.14:36647 failed, connecting over all interfaces
>> failed!
>>
>> My question is: This error depends by some timeout? How can i solve?
>> Thanks in advance.
>>
>> Than
>>
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it                    Tel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it                    Tel:   +39 051 6171722
g.fatigati [AT] cineca.it