The most probable reason for the observed behaviour is that there are additional active network interfaces on the nodes that cannot be used to pass data around. Example of such interfaces are various virtual Ethernet devices (e.g. on systems with virtualisation enabled) or tunnels. Open MPI tries to maximise the network bandwidth by cycling over the available endpoints on each node (with the basic presumption being that different IP addresses are routed over different physical networks and hence more bandwidth is available) and that's why it fails with more than one message - the first message goes to the reachable node IP address while the second one gets directed to an unreachable one.
The solution is to either tell Open MPI to ignore the offending interfaces or to specifically state what interfaces are to be used by the TCP BTL and OOB components. This entry in the FAQ gives more details:
Probably the following options would remedy your problem:
--mca btl_tcp_if_exclude 192.168.0.0/16,127.0.0.1/8
--mca btl_oob_if_exclude 192.168.0.0/16,127.0.0.1/8
Note that the loopback interface has to be part of the excluded interfaces list if the latter is provided.
The list of the active interfaces can be obtained with the "/sbin/ifconfig" command. Look for interfaces in state "UP".
Hristo Iliev, PhD – High Performance Computing Team
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
Phone: +49 241 80 24367 – Fax/UMS: +49 241 80 624367