Hi Claire,

 

The most probable reason for the observed behaviour is that there are additional active network interfaces on the nodes that cannot be used to pass data around. Example of such interfaces are various virtual Ethernet devices (e.g. on systems with virtualisation enabled) or tunnels. Open MPI tries to maximise the network bandwidth by cycling over the available endpoints on each node (with the basic presumption being that different IP addresses are routed over different physical networks and hence more bandwidth is available) and that's why it fails with more than one message - the first message goes to the reachable node IP address while the second one gets directed to an unreachable one.

 

The solution is to either tell Open MPI to ignore the offending interfaces or to specifically state what interfaces are to be used by the TCP BTL and OOB components. This entry in the FAQ gives more details:

 

http://www.open-mpi.org/faq/?category=tcp#tcp-selection

 

Probably the following options would remedy your problem:

 

--mca btl_tcp_if_exclude 192.168.0.0/16,127.0.0.1/8

--mca btl_oob_if_exclude 192.168.0.0/16,127.0.0.1/8

 

Note that the loopback interface has to be part of the excluded interfaces list if the latter is provided.

 

The list of the active interfaces can be obtained with the "/sbin/ifconfig" command. Look for interfaces in state "UP".

 

--

Hristo Iliev, PhD – High Performance Computing Team

RWTH Aachen University, Center for Computing and Communication

Rechen- und Kommunikationszentrum der RWTH Aachen

Seffenter Weg 23, D 52074 Aachen (Germany)

Phone: +49 241 80 24367 – Fax/UMS: +49 241 80 624367

 

 

From: users-bounces@open-mpi.org [mailto:users-bounces@open-mpi.org] On Behalf Of Claire Williams
Sent: Tuesday, June 18, 2013 7:15 PM
To: users@open-mpi.org
Subject: [OMPI users] Trouble with Sending Multiple messages to the Same Machine

 

Hi guys ☺!

 

I'm working with a simple "Hello, World" MPI program that has one master and is sending one message to each worker, receives a message back from each of the workers, and re-sends a new message. This unfortunately is not working :(. When the master only sends one message to each worker, and then receives it, it is working fine, but there are problems with sending more than one message to each worker. When it happens, it prints the error:

 

[[401,1],0][../../../../../openmpi-1.6.3/ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.X.X failed: No route to host (113)

 

I'm wondering how I can go about fixing this. This program is running across multiple Linux nodes, by the way :). 

 

BTW, I'm a girl.