Nicolas Niclausse wrote:
> Fernando Lemos ecrivait le 23/03/2010 16:28:
>>> I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several
>>> interfaces are private;
>>> on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible
>>> from cluster2.
>>> eth0 inet addr:192.168.160.76 Bcast:192.168.160.255 Mask:255.255.255.0
>>> eth1 inet addr:192.168.159.76 Bcast:192.168.159.255 Mask:255.255.255.0
>>> myri0 inet addr:192.168.162.76 Bcast:192.168.162.255 Mask:255.255.255.0
>>> on cluster2, nodes have 3 interfaces, and only 172.24.110.0/17 is visible
>>> from cluster1
>>> eth0 inet addr:172.24.190.8 Bcast:172.24.191.255 Mask:255.255.192.0
>>> eth1 inet addr:172.24.110.8 Bcast:172.24.127.255 Mask:255.255.128.0
>>> eth2 inet addr:172.24.240.8 Bcast:172.24.255.255 Mask:255.255.192.0
>>> so i'm using this to declare all the other networks as private:
>>> mpirun -machinefile ~/gridnodes --mca opal_net_private_ipv4
>>> but this doesn't work:
>> Have you tried -mca btl_tcp_if_include/exclude?
> I can't do that because the "public" interface is not always eth1 as in
> this example (i have several other clusters with different network
> configurations in my setup)
>>> Why openmpi tries to connect different private networks, given that
>>> "public" networks exists ? is it a bug or am i missing something ?
>> >From what I've seen, I believe OpenMPI tries to find the fastest route
>> to the nodes. In some cases it's trivial to sort that out, in other
>> cases you might need to give it some hints.
> yes, so i thought that "opal_net_private_ipv4" was the right thing for me;
> but it doesn't work without the patch.
It seems to me that you are entering a piece of the code where the code
thinks at least one of the interfaces is private. And when comparing a
public and private, it gives a weighting of
CQ_PRIVATE_DIFFERENT_NETWORK. I am not sure why, but that is the
weighting it gives. You can take a look at this FAQ
http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3 which has
links to the paper that explains how all this logic works.
It seems that what you are doing makes sense. You are trying to define
which networks are private so that in the end you
expect the two other networks to end up being public, and therefore get
the highest weight for a connection.
I realize this does not help much, but maybe the paper will help out.