Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in oob_tcp_[in|ex]clude?
From: Marco Sbrighi (m.sbrighi_at_[hidden])
Date: 2007-12-18 11:28:50


On Mon, 2007-12-17 at 20:58 -0500, Brian Dobbins wrote:
> Hi Marco and Jeff,
>
> My own knowledge of OpenMPI's internals is limited, but I thought
> I'd add my less-than-two-cents...
>
> > I've found only a way in order to have tcp connections
> binded only to
> > the eth1 interface, using both the following MCA directives
> in the
> > command line:
> >
> > mpirun .... --mca oob_tcp_include eth1 --mca
> oob_tcp_include
> > lo,eth0,ib0,ib1 .....
> >
> > This sounds me as bug.
>
>
> Yes, it does. Specifying the MCA same param twice on the
> command line
> results in undefined behavior -- it will only take one of
> them, and I
> assume it'll take the first (but I'd have to check the code to
> be sure).
>
> I think that Marco intended to write:
> mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_exclude
> lo,eth0,ib0,ib1 ...

no, I intended to write exactly what I wrote. The double statement is
reported by --mca mpi_show_mca_params exactly as I write one statement
only, as follows:

--mca oob_tcp_include eth1,lo,eth0,ib0,ib1

>
> Is this correct? So you're not specifying include twice, you're
> specifying include and exclude, so each interface is explicitly stated
> in one list or the other. I remember encountering this behaviour as
> well, in a slightly different format, but I can't seem to reproduce it
> now either.

notice, the two lists are never intersecting.

> That said, with these options, won't the MPI traffic (as opposed to
> the OOB traffic) still use the eth1,ib0 and ib1 interfaces? You'd
> need to add '-mca btl_tcp_include eth1' in order to say it should only
> go over that NIC, I think.

Yes I know, in fact -mca btl_tcp_[if]_exclude lo,eth0,ib0,ib1
works fine (seems). I'm using this MCA parameter since open-mpi 1.2.1
and the trouble with oob_tcp_[if]_[in|ex]clude sounded quite strange to
me, after all the code used for the parser should be more or less the
same .....

>
> As for the 'connection errors', two bizarre things to check are,
> first, that all of your nodes using eth1 actually have
> correct /etc/hosts mappings to the other nodes. One system I ran on
> had this problem when some nodes had an IP address for node002 as one
> thing, and another node had node002's IP address as something
> different. This should be easy enough by trying to run on one node
> first, then two nodes that you're sure have the correct addresses.

Yes, I've already verified that.

>
> .. The second situation is if you're launching an MPMD program.
> Here, you need to use '-gmca <whatever>' instead of '-mca <whatever>'.
>

No, currently I'm using only SPMD ones, and I hope to use them for the
rest of the century :-)

> Hope some of that is at least a tad useful. :)
>

Thanks you very much Brian,

Marco

> Cheers,
> - Brian
>

-- 
-----------------------------------------------------------------
 Marco Sbrighi  m.sbrighi_at_[hidden]
 HPC Group
 CINECA Interuniversity Computing Centre
 via Magnanelli, 6/3
 40033 Casalecchio di Reno (Bo) ITALY
 tel. 051 6171516