Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] multiple GigE interfaces...
From: Adrian Knoth (adi_at_[hidden])
Date: 2008-06-23 08:15:53


On Wed, Jun 18, 2008 at 05:13:28PM -0700, Muhammad Atif wrote:

> Hi again... I was on a break from Xensocket stuff.... This time some
> general questions...

Hi.

> question. What if I have multiple Ethernet cards (say 5) on two of my
> quad core machines. The IP addresses (and the subnets of course) are
> Machine A Machine B
> eth0 is y.y.1.a y.y.1.z
> eth1 is y.y.4.b y.y.4.y
> eth2 is y.y.4.c ...
> eth3 is y.y.4.d ...
>
> ...

This sounds pretty weird. And I guess your netmasks don't allow to
separate the NICs, do they?

> from the FAQ's/Some emails in user lists it is clear that if I want
> to run a job on multiple ethernets, I can use --mca btl_tcp_if_include
> eth0,eth1. This

You can, but you don't have to. If you don't specify something, OMPI
will choose "something right".

> will run the job on two of the subnets utilizing both the Ethernet
> cards. Is it doing some sort of load balancing? or some round robin
> mechanism? What part of code is responsible for this work?

As far as I know, it's handled by OB1 (PML), which does striping across
several BTL instances.

So in other words, as long as both segments are equally fast, the load
balancing should do fine. If they differ in performance, the OB1 doesn't
find an optimal solution. If you're hitting this case, ask htor, he has
an auto-tuning replacement, but that's not going to be part of OMPI.

> eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet.
> Even in the FAQ's (which mostly answers our lame questions) its not
> entirely clear how communication will be done. Each process will have
> tcp_num_btls equal to interfaces, but then what? Is it some sort of
> load balancing or similar stuff which is not clear in tcpdump?

I feel you could end up with communication stalls, the typical hang
situation. One problem that might occur: the TCP component looks for
remote addresses on the "same" network, so the component might be unable
to decide whether your IP is on the same physical network or uses
the wrong link. Then, you won't gain anything.

Another problem: at least the Linux kernel (without tweaking) decides
which interface and address to use for outgoing communication. If you
have multiple subnets, then the kernel would go for the closest match
between local and remote addresses, but in your case, it might be some
kind of lottery.

> related question is what if I want to run 8 process job (on 2x4
> cluster) and want to pin a process to an network interface. OpenMPI to
> my understanding does not give any control of allocating IP to a
> process (like MPICH)

You could just say btl_if_include=ethX, thus giving you the right
network interface. Obviously, this requires separate networks.

> or is there some magical --mca thingie. I think only way to go is
> adding routing tables... am i thinking in right direction? If yes, then
> the performance of my boxes decrease when i trying to force the routing

Routing should be fast, since it's done at kernel level. I cannot speak
for Xen-based virtual interfaces.

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany
private: http://adi.thur.de