On Mon, 2007-12-17 at 20:58 -0500, Brian Dobbins wrote:
> Hi Marco and Jeff,
> My own knowledge of OpenMPI's internals is limited, but I thought
> I'd add my less-than-two-cents...
> > I've found only a way in order to have tcp connections
> binded only to
> > the eth1 interface, using both the following MCA directives
> in the
> > command line:
> > mpirun .... --mca oob_tcp_include eth1 --mca
> > lo,eth0,ib0,ib1 .....
> > This sounds me as bug.
> Yes, it does. Specifying the MCA same param twice on the
> command line
> results in undefined behavior -- it will only take one of
> them, and I
> assume it'll take the first (but I'd have to check the code to
> be sure).
> I think that Marco intended to write:
> mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_exclude
> lo,eth0,ib0,ib1 ...
no, I intended to write exactly what I wrote. The double statement is
reported by --mca mpi_show_mca_params exactly as I write one statement
only, as follows:
--mca oob_tcp_include eth1,lo,eth0,ib0,ib1
> Is this correct? So you're not specifying include twice, you're
> specifying include and exclude, so each interface is explicitly stated
> in one list or the other. I remember encountering this behaviour as
> well, in a slightly different format, but I can't seem to reproduce it
> now either.
notice, the two lists are never intersecting.
> That said, with these options, won't the MPI traffic (as opposed to
> the OOB traffic) still use the eth1,ib0 and ib1 interfaces? You'd
> need to add '-mca btl_tcp_include eth1' in order to say it should only
> go over that NIC, I think.
Yes I know, in fact -mca btl_tcp_[if]_exclude lo,eth0,ib0,ib1
works fine (seems). I'm using this MCA parameter since open-mpi 1.2.1
and the trouble with oob_tcp_[if]_[in|ex]clude sounded quite strange to
me, after all the code used for the parser should be more or less the
> As for the 'connection errors', two bizarre things to check are,
> first, that all of your nodes using eth1 actually have
> correct /etc/hosts mappings to the other nodes. One system I ran on
> had this problem when some nodes had an IP address for node002 as one
> thing, and another node had node002's IP address as something
> different. This should be easy enough by trying to run on one node
> first, then two nodes that you're sure have the correct addresses.
Yes, I've already verified that.
> .. The second situation is if you're launching an MPMD program.
> Here, you need to use '-gmca <whatever>' instead of '-mca <whatever>'.
No, currently I'm using only SPMD ones, and I hope to use them for the
rest of the century :-)
> Hope some of that is at least a tad useful. :)
Thanks you very much Brian,
> - Brian
Marco Sbrighi m.sbrighi_at_[hidden]
CINECA Interuniversity Computing Centre
via Magnanelli, 6/3
40033 Casalecchio di Reno (Bo) ITALY
tel. 051 6171516