Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in oob_tcp_[in|ex]clude?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-12-17 17:19:50


On Dec 17, 2007, at 8:35 AM, Marco Sbrighi wrote:

> I'm using Open MPI 1.2.2 over OFED 1.2 on an 256 nodes, dual Opteron,
> dual core, Linux cluster. Of course, with Infiniband 4x interconnect.
>
> Each cluster node is equipped with 4 (or more) ethernet interface,
> namely 2 gigabit ones plus 2 IPoIB. The two gig are named eth0,eth1,
> while the two IPoIB are named ib0,ib1.
>
> It happens that the eth0 is a management network, with poor
> performances, and furthermore we wouldn't use the ib* to carry MPI's
> traffic (neither OOB or TCP), so we would like the eth1 is used for
> open
> MPI OOB and TCP.
>
> In order to drive the OOB over only eth1 I've tried various
> combinations
> of oob_tcp_[ex|in]clude MCA statements, starting from the obvious
>
> oob_tcp_exclude = lo,eth0,ib0,ib1
>
> then trying the othe obvious:
>
> oob_tcp_include = eth1

This one statement (_include) should be sufficient.

Assumedly this(these) statement(s) are in a config file that is being
read by Open MPI, such as $HOME/.openmpi/mca-params.conf?

> and both at the same time.
>
> Next I've tried the following:
>
> oob_tcp_exclude = eth0
>
> but after the job starts, I still have a lot of tcp connections
> established using eth0 or ib0 or ib1.
> Furthermore It happens the following error:
>
> [node191:03976] [0,1,14]-[0,1,12] mca_oob_tcp_peer_complete_connect:
> connection failed: Connection timed out (110) - retrying

This is quite odd. :-(

> I've found only a way in order to have tcp connections binded only to
> the eth1 interface, using both the following MCA directives in the
> command line:
>
> mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include
> lo,eth0,ib0,ib1 .....
>
> This sounds me as bug.

Yes, it does. Specifying the MCA same param twice on the command line
results in undefined behavior -- it will only take one of them, and I
assume it'll take the first (but I'd have to check the code to be sure).

> Is there someone able to reproduce this behaviour?
> If this is a bug, are there fixes?

I'm unfortunately unable to reproduce this behavior. I have a test
cluster with 2 IP interfaces: ib0, eth0. I have tried several
combinations of MCA params with 1.2.2:

    --mca oob_tcp_include ib0
    --mca oob_tcp_include ib0,bogus
    --mca oob_tcp_include eth0
    --mca oob_tcp_include eth0,bogus
    --mca oob_tcp_exclude ib0
    --mca oob_tcp_exclude ib0,bogus
    --mca oob_tcp_exclude eth0
    --mca oob_tcp_exclude eth0,bogus

All do as they are supposed to -- including or excluding ib0 or eth0.

I do note, however, that the handling of these parameters changed in
1.2.3 -- as well as their names. The names changed to
"oob_tcp_if_include" and "oob_tcp_if_exclude" to match other MCA
parameter name conventions from other components.

Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5 is
due out "soon" -- it *may* get out before the holiday break, but no
promises...)?

If you can't upgrade, let me know and I can provide a debugging patch
that will give us a little more insight into what is happening on your
machines. Thanks.

-- 
Jeff Squyres
Cisco Systems