Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Prasanna Ranganathan (prasanna_at_[hidden])
Date: 2008-09-08 16:36:49


Hi,

I am trying to run a test mpiHelloWorld program that simply initializes the
MPI environment on all the nodes, prints the hostname and rank of each node
in the MPI process group and exits.

I am using MPI 1.1.2 and am running 997 processes on 499 nodes (Nodes have 2
dual core CPUs).

I get the following error messages when I run my program as follows: mpirun
-np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.....
.....
.....
[0,1,380][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,142][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,390][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with errno=113connect()
failed with
errno=113[0,1,138][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_conn
ect]
connect() failed with
errno=113[0,1,384][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_conn
ect] [0,1,144][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_conn
ect] connect() failed with errno=113
[0,1,139][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113
.....
.....

The main thing is that I get these error messages around 3-4 times out of 10
attempts with the rest all completing successfully. I have looked into the
FAQs in detail and also checked the tcp btl settings but am not able to
figure it out.

All the 499 nodes have only eth0 active and I get the error even when I run
the following: mpirun -np 997 -bynode ­hostfile nodelist --mca
btl_tcp_if_include eth0 /main/mpiHelloWorld

I have attached the output of ompi_info ‹all.

The following is the output of /sbin/ifconfig on the node where I start the
mpi process (it is one of the 499 nodes)

eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
          inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:17
          TX packets:1767028063 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552
(657385.4 Mb)
          Interrupt:22 Base address:0xc000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb)

Kindly help.

Regards,

Prasanna.