Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Eric Thibodeau (kyron_at_[hidden])
Date: 2008-09-10 19:26:49


Prasanna Ranganathan wrote:
> Hi,
>
> I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
> 1.2.6-r1 to be the latest stable version of openMPI).
>
Prasanna, do a sync, 1.2.7 is in portage and report back.

Eric
> I do still get the following error message when running my test helloWorld
> program:
>
> [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect] connect() failed with
> errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_
> complete_connect] connect() failed with errno=113
> [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect] connect() failed with errno=113
> [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect] connect() failed with errno=113
>
> Again, this error does not happen with every run of the test program and
> occurs only certain times.
>
> How do I take care of this?
>
> Regards,
>
> Prasanna.
>
>
> On 9/9/08 9:00 AM, "users-request_at_[hidden]" <users-request_at_[hidden]>
> wrote:
>
>
>> Message: 1
>> Date: Mon, 8 Sep 2008 16:43:33 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] Need help resolving No route to host error
>> with OpenMPI 1.1.2
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <AF302D68-0D30-469E-AFD3-566FF962814B_at_[hidden]>
>> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
>> delsp=yes
>>
>> Are you able to upgrade to Open MPI v1.2.7?
>>
>> There were *many* bug fixes and changes in the 1.2 series compared to
>> the 1.1 series, some, in particular, were dealing with TCP socket
>> timeouts (which are important when dealing with large numbers of MPI
>> processes).
>>
>>
>>
>> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
>>
>>
>>> Hi,
>>>
>>> I am trying to run a test mpiHelloWorld program that simply
>>> initializes the MPI environment on all the nodes, prints the
>>> hostname and rank of each node in the MPI process group and exits.
>>>
>>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>>> (Nodes have 2 dual core CPUs).
>>>
>>> I get the following error messages when I run my program as follows:
>>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>>> .....
>>> .....
>>> .....
>>> [0,1,380][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>> [0,1,140][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>> connect() failed with errno=113
>>> connect() failed with errno=113connect() failed with
>>> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>> connect() failed with errno=113
>>> [0,1,388][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>> errno=113[0,1,386][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>> errno=113
>>> [0,1,139][btl_tcp_endpoint.c:
>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>> errno=113
>>> connect() failed with errno=113
>>> .....
>>> .....
>>>
>>> The main thing is that I get these error messages around 3-4 times
>>> out of 10 attempts with the rest all completing successfully. I have
>>> looked into the FAQs in detail and also checked the tcp btl settings
>>> but am not able to figure it out.
>>>
>>> All the 499 nodes have only eth0 active and I get the error even
>>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>>>
>>> I have attached the output of ompi_info ?all.
>>>
>>> The following is the output of /sbin/ifconfig on the node where I
>>> start the mpi process (it is one of the 499 nodes)
>>>
>>> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
>>> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
>>> 17
>>> TX packets:1767028063 errors:0 dropped:0 overruns:0
>>> carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552
>>> (657385.4 Mb)
>>> Interrupt:22 Base address:0xc000
>>>
>>> lo Link encap:Local Loopback
>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb)
>>>
>>>
>>> Kindly help.
>>>
>>> Regards,
>>>
>>> Prasanna.
>>>
>>> <ompi_info.rtf>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>