Hello,
I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several
interfaces are private;
on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible
from cluster2.
chicon-3
eth0 inet addr:192.168.160.76 Bcast:192.168.160.255 Mask:255.255.255.0
eth1 inet addr:192.168.159.76 Bcast:192.168.159.255 Mask:255.255.255.0
myri0 inet addr:192.168.162.76 Bcast:192.168.162.255 Mask:255.255.255.0
on cluster2, nodes have 3 interfaces, and only 172.24.110.0/17 is visible
from cluster1
netgdx-8
eth0 inet addr:172.24.190.8 Bcast:172.24.191.255 Mask:255.255.192.0
eth1 inet addr:172.24.110.8 Bcast:172.24.127.255 Mask:255.255.128.0
eth2 inet addr:172.24.240.8 Bcast:172.24.255.255 Mask:255.255.192.0
so i'm using this to declare all the other networks as private:
mpirun -machinefile ~/gridnodes --mca opal_net_private_ipv4
"192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18"
./alltoall
but this doesn't work:
[netgdx-8][[64214,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.160.76 failed: No route to host (113)
[netgdx-8][[64214,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.160.76 failed: No route to host (113)
[netgdx-8][[64214,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.160.76 failed: No route to host (113)
[netgdx-8][[64214,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.160.76 failed: No route to host (113)
[netgdx-8][[64214,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.160.76 failed: No route to host (113)
the following patch works for me :
diff -u ompi/mca/btl/tcp/btl_tcp_proc.c.orig ompi/mca/btl/tcp/btl_tcp_proc.c
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23
14:01:28.000000000 +0100
+++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100
@@ -496,7 +496,7 @@
local_interfaces[i]->ipv4_netmask)) {
weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
} else {
- weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
+ weights[i][j] = CQ_NO_CONNECTION;
}
best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr;
}
Why openmpi tries to connect different private networks, given that
"public" networks exists ? is it a bug or am i missing something ?
--
Nicolas NICLAUSSE Service DREAM
INRIA Sophia Antipolis http://www-sop.inria.fr/
2004 route des lucioles - BP 93 Tel: (33/0) 4 92 38 76 93
06902 SOPHIA-ANTIPOLIS cedex (France) Fax: (33/0) 4 92 38 76 02
|