The most probable reason for the observed behaviour is that there are additional active network interfaces on the nodes that cannot be used to pass data around. Example of such interfaces are various virtual Ethernet devices (e.g. on systems with virtualisation enabled) or tunnels. Open MPI tries to maximise the network bandwidth by cycling over the available endpoints on each node (with the basic presumption being that different IP addresses are routed over different physical networks and hence more bandwidth is available) and that's why it fails with more than one message - the first message goes to the reachable node IP address while the second one gets directed to an unreachable one.
The solution is to either tell Open MPI to ignore the offending interfaces or to specifically state what interfaces are to be used by the TCP BTL and OOB components. This entry in the FAQ gives more details:
Probably the following options would remedy your problem:
--mca btl_tcp_if_exclude 192.168.0.0/16,127.0.0.1/8
--mca btl_oob_if_exclude 192.168.0.0/16,127.0.0.1/8
Note that the loopback interface has to be part of the excluded interfaces list if the latter is provided.
The list of the active interfaces can be obtained with the "/sbin/ifconfig" command. Look for interfaces in state "UP".
Hristo Iliev, PhD â High Performance Computing Team
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
Phone: +49 241 80 24367 â Fax/UMS: +49 241 80 624367
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Claire Williams
Sent: Tuesday, June 18, 2013 7:15 PM
Subject: [OMPI users] Trouble with Sending Multiple messages to the Same Machine
Hi guys âº!
I'm working with a simple "Hello, World" MPI program that has one master and is sending one message to each worker, receives a message back from each of the workers, and re-sends a new message. This unfortunately is not working :(. When the master only sends one message to each worker, and then receives it, it is working fine, but there are problems with sending more than one message to each worker. When it happens, it prints the error:
[[401,1],0][../../../../../openmpi-1.6.3/ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.X.X failed: No route to host (113)
I'm wondering how I can go about fixing this. This program is running across multiple Linux nodes, by the way :).
BTW, I'm a girl.
- application/pkcs7-signature attachment: smime.p7s