Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection timed out on TCP
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-04-28 18:07:08


In principle, there's nothing wrong with using ib0 interfaces for TCP MPI communication, but it does raise the question of why you're using TCP when you have InfiniBand available...?

Aside from that, can you send all the info listed here:

   http://www.open-mpi.org/community/help/

On Apr 28, 2014, at 11:08 AM, Vince Grimes <tom.grimes_at_[hidden]> wrote:

> After barring the ib0 interfaces, I still get "Connection timed out" errors even over the Ethernet interfaces.
>
> At the end of the output I not get the following messages in addition to the one above:
>
> --------------------------------------------------------------------------
> Sorry! You were supposed to get help about:
> client handshake fail
> from the file:
> help-mpi-btl-tcp.txt
> But I couldn't find that topic in the file. Sorry!
> --------------------------------------------------------------------------
>
> The Ethernet switches are managed. Is it likely there is something set wrong?
>
> T. Vince Grimes, Ph.D.
> CCC System Administrator
>
> Texas Tech University
> Dept. of Chemistry and Biochemistry (10A)
> Box 41061
> Lubbock, TX 79409-1061
>
> (806) 834-0813 (voice); (806) 742-1289 (fax)
>
> On 04/25/2014 04:22 PM, users-request_at_[hidden] wrote:
>
>> Message: 3
>> Date: Fri, 25 Apr 2014 14:56:47 -0500
>> From: Vince Grimes <tom.grimes_at_[hidden]>
>> To: <users_at_[hidden]>
>> Subject: [OMPI users] Connection timed out on TCP
>> Message-ID: <535ABDFF.1020409_at_[hidden]>
>> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
>>
>> There is no firewall on this subnet as it is the internal Ethernet for
>> the cluster.
>>
>> However, I double-checked the offending IPs and discovered they are
>> Infiniband IPoIB addresses! I'm now trying to exclude the ib0 interface
>> as in https://www.open-mpi.org/faq/?category=tcp#tcp-selection
>>
>> T. Vince Grimes, Ph.D.
>> CCC System Administrator
>>
>> Texas Tech University
>> Dept. of Chemistry and Biochemistry (10A)
>> Box 41061
>> Lubbock, TX 79409-1061
>>
>> (806) 834-0813 (voice); (806) 742-1289 (fax)
>>
>> On 04/25/2014 11:00 AM, users-request_at_[hidden] wrote:
>>> Date: Thu, 24 Apr 2014 19:49:26 -0700 From: Ralph Castain
>>> <rhc_at_[hidden]> To: Open MPI Users <users_at_[hidden]> Subject: Re:
>>> [OMPI users] Connection timed out on TCP and notify question Message-ID:
>>> <11462B85-83CA-4B3D-86E5-EDDD9BC872A6_at_[hidden]> Content-Type:
>>> text/plain; charset=us-ascii Sounds like either a routing problem or a
>>> firewall. Are there multiple NICs on these nodes? Looking at the quoted
>>> NIC in your error message, is that the correct subnet we should be
>>> using? Have you checked to ensure no firewalls exist on that subnet
>>> between the nodes? On Apr 24, 2014, at 8:41 AM, Vince Grimes
>>> <tom.grimes_at_[hidden]> wrote:
>>>>> Dear all:
>>>>>
>>>>> In the ongoing investigation into why a particular in-house program is not working in parallel over multiple nodes using OpenMPI, running with "--mca btl self,sm,tcp" I have been running into the following error:
>>>>>
>>>>> [compute-6-15.local][[8185,1],0 [btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect() to 10.7.36.247 failed: Connection timed out (110)
>>>>>
>>>>> I thought at first it was due to running out of file handles (sockets are considered files), but I have amended limits.d to allow 102400 files (up from the default of 1024), which should be more than enough.
>>>>>
>>>>> What is going on? Trying to connect to 4/20 nodes gave the error above.
>>>>>
>>>>> My second question involves the notify system for btl openib. What does the syslog notifier require in order to work? I want to see if the errors running the same program with openib are due to dropped IB connections.
>>>>>
>>>>> --
>>>>> T. Vince Grimes, Ph.D.
>>>>> CCC System Administrator
>>>>>
>>>>> Texas Tech University
>>>>> Dept. of Chemistry and Biochemistry (10A)
>>>>> Box 41061
>>>>> Lubbock, TX 79409-1061
>>>>>
>>>>> (806) 834-0813 (voice); (806) 742-1289 (fax)
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/