Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange Net problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-01 09:48:44


Hi Gabriele

I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very
well to that size due to a requirement that the underlying out-of-band
system fully connect at the TCP level. Thus, every process in your job
will be opening 2002 sockets (one to every other process, one to the
local orted, and one back to mpirun). More than likely, you are simply
running out of sockets on your nodes.

For a job this size, I would recommend upgrading to OMPI 1.3.1. This
uses a routing scheme for the out-of-band system, so each process only
opens 1 socket to its local daemon. Much more scalable, and I think it
would solve this problem. It will also start much faster, as a bonus.

HTH
Ralph

On Apr 1, 2009, at 3:58 AM, Gabriele Fatigati wrote:

> Dear OpenMPI developers, m
> i have a strange problem during running my application ( 2000
> processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
> the mca-params.conf:
>
>
> btl = ^tcp
> btl_tcp_if_exclude = eth0,ib0,ib1
> oob_tcp_include = eth1,lo,eth0
> btl_openib_warn_default_gid_prefix = 0
> btl_openib_ib_timeout = 20
>
> At certain point of my run, the application died with this message:
>
> [node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect:
> connect to 10.161.12.14:36645 failed: Software caused connection abort
> (103)
> [node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect:
> connect to 10.161.12.14:36647 failed: Software caused connection abort
> (103)
> [node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect:
> connect to 10.161.12.14:36647 failed: Software caused connection abort
> (103)
> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
> connect to 10.161.12.14:36647 failed: Software caused connection abort
> (103)
> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
> connect to 10.161.12.14:36647 failed, connecting over all interfaces
> failed!
>
> My question is: This error depends by some timeout? How can i solve?
> Thanks in advance.
>
> Than
>
>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.it Tel: +39 051 6171722
>
> g.fatigati [AT] cineca.it
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users