Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Connection timed out on TCP
From: Vince Grimes (tom.grimes_at_[hidden])
Date: 2014-04-25 15:56:47


There is no firewall on this subnet as it is the internal Ethernet for
the cluster.

However, I double-checked the offending IPs and discovered they are
Infiniband IPoIB addresses! I'm now trying to exclude the ib0 interface
as in https://www.open-mpi.org/faq/?category=tcp#tcp-selection

T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice); (806) 742-1289 (fax)

On 04/25/2014 11:00 AM, users-request_at_[hidden] wrote:
> Date: Thu, 24 Apr 2014 19:49:26 -0700 From: Ralph Castain
> <rhc_at_[hidden]> To: Open MPI Users <users_at_[hidden]> Subject: Re:
> [OMPI users] Connection timed out on TCP and notify question Message-ID:
> <11462B85-83CA-4B3D-86E5-EDDD9BC872A6_at_[hidden]> Content-Type:
> text/plain; charset=us-ascii Sounds like either a routing problem or a
> firewall. Are there multiple NICs on these nodes? Looking at the quoted
> NIC in your error message, is that the correct subnet we should be
> using? Have you checked to ensure no firewalls exist on that subnet
> between the nodes? On Apr 24, 2014, at 8:41 AM, Vince Grimes
> <tom.grimes_at_[hidden]> wrote:
>> >Dear all:
>> >
>> > In the ongoing investigation into why a particular in-house program is not working in parallel over multiple nodes using OpenMPI, running with "--mca btl self,sm,tcp" I have been running into the following error:
>> >
>> >[compute-6-15.local][[8185,1],0 [btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect() to 10.7.36.247 failed: Connection timed out (110)
>> >
>> >I thought at first it was due to running out of file handles (sockets are considered files), but I have amended limits.d to allow 102400 files (up from the default of 1024), which should be more than enough.
>> >
>> > What is going on? Trying to connect to 4/20 nodes gave the error above.
>> >
>> > My second question involves the notify system for btl openib. What does the syslog notifier require in order to work? I want to see if the errors running the same program with openib are due to dropped IB connections.
>> >
>> >--
>> >T. Vince Grimes, Ph.D.
>> >CCC System Administrator
>> >
>> >Texas Tech University
>> >Dept. of Chemistry and Biochemistry (10A)
>> >Box 41061
>> >Lubbock, TX 79409-1061
>> >
>> >(806) 834-0813 (voice); (806) 742-1289 (fax)
>> >_______________________________________________
>> >users mailing list
>> >users_at_[hidden]
>> >http://www.open-mpi.org/mailman/listinfo.cgi/users