Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Prasanna Ranganathan (prasanna_at_[hidden])
Date: 2008-09-10 18:08:41


Hi,

I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
1.2.6-r1 to be the latest stable version of openMPI).

I do still get the following error message when running my test helloWorld
program:

[10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with
errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_
complete_connect] connect() failed with errno=113
[10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113
[10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113

Again, this error does not happen with every run of the test program and
occurs only certain times.

How do I take care of this?

Regards,

Prasanna.

On 9/9/08 9:00 AM, "users-request_at_[hidden]" <users-request_at_[hidden]>
wrote:

>
> Message: 1
> Date: Mon, 8 Sep 2008 16:43:33 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] Need help resolving No route to host error
> with OpenMPI 1.1.2
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <AF302D68-0D30-469E-AFD3-566FF962814B_at_[hidden]>
> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
> delsp=yes
>
> Are you able to upgrade to Open MPI v1.2.7?
>
> There were *many* bug fixes and changes in the 1.2 series compared to
> the 1.1 series, some, in particular, were dealing with TCP socket
> timeouts (which are important when dealing with large numbers of MPI
> processes).
>
>
>
> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
>
>> Hi,
>>
>> I am trying to run a test mpiHelloWorld program that simply
>> initializes the MPI environment on all the nodes, prints the
>> hostname and rank of each node in the MPI process group and exits.
>>
>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>> (Nodes have 2 dual core CPUs).
>>
>> I get the following error messages when I run my program as follows:
>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>> .....
>> .....
>> .....
>> [0,1,380][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> [0,1,140][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> connect() failed with errno=113connect() failed with
>> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> [0,1,388][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113[0,1,386][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> [0,1,139][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> connect() failed with errno=113
>> .....
>> .....
>>
>> The main thing is that I get these error messages around 3-4 times
>> out of 10 attempts with the rest all completing successfully. I have
>> looked into the FAQs in detail and also checked the tcp btl settings
>> but am not able to figure it out.
>>
>> All the 499 nodes have only eth0 active and I get the error even
>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>>
>> I have attached the output of ompi_info ?all.
>>
>> The following is the output of /sbin/ifconfig on the node where I
>> start the mpi process (it is one of the 499 nodes)
>>
>> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
>> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
>> 17
>> TX packets:1767028063 errors:0 dropped:0 overruns:0
>> carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552
>> (657385.4 Mb)
>> Interrupt:22 Base address:0xc000
>>
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb)
>>
>>
>> Kindly help.
>>
>> Regards,
>>
>> Prasanna.
>>
>> <ompi_info.rtf>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users