Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2008-09-09 06:06:29


Hi,

First, consider to update to newer OpenMPI.

Second, look on your environment on the box you startts OpenMPI (runs
mpirun ...).

Type
ulimit -n
to explore how many file descriptors your envirinment have. (ulimit -a
for all limits). Note, every process on older versions of OpenMPI (prior
1.2.6 inclusively) needs an own file descriptor for each process
started, IMHO. Maybe its your problem? Does your HelloWorld run OK with
some 500 processes?

best regards
PK

Prasanna Ranganathan wrote:
> Hi,
>
> I am trying to run a test mpiHelloWorld program that simply initializes
> the MPI environment on all the nodes, prints the hostname and rank of
> each node in the MPI process group and exits.
>
> I am using MPI 1.1.2 and am running 997 processes on 499 nodes (Nodes
> have 2 dual core CPUs).
>
> I get the following error messages when I run my program as follows:
> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
> .....
> .....
> .....
> [0,1,380][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> [0,1,142][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> [0,1,140][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> [0,1,390][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> connect() failed with errno=113connect() failed with errno=113connect()
> failed with
> errno=113[0,1,138][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>
> connect() failed with
> errno=113[0,1,384][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> [0,1,144][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> [0,1,388][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with
> errno=113[0,1,386][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> [0,1,139][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> connect() failed with errno=113
> .....
> .....
>
> *The main thing is that I get these error messages around 3-4 times out
> of 10 attempts with the rest all completing successfully. I have looked
> into the FAQs in detail and also checked the tcp btl settings but am not
> able to figure it out.
> *
> All the 499 nodes have only eth0 active and I get the error even when I
> run the following: mpirun -np 997 -bynode –hostfile nodelist --mca
> btl_tcp_if_include eth0 /main/mpiHelloWorld
>
> I have attached the output of ompi_info —all.
>
> The following is the output of /sbin/ifconfig on the node where I start
> the mpi process (it is one of the 499 nodes)
>
> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:17
> TX packets:1767028063 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552
> (657385.4 Mb)
> Interrupt:22 Base address:0xc000
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb)
>
>
> Kindly help.
>
> Regards,
>
> Prasanna.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users