Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in oob_tcp_[in|ex]clude?
From: Brian Dobbins (bdobbins_at_[hidden])
Date: 2007-12-17 20:58:53


Hi Marco and Jeff,

  My own knowledge of OpenMPI's internals is limited, but I thought I'd add
my less-than-two-cents...

> I've found only a way in order to have tcp connections binded only to
> > the eth1 interface, using both the following MCA directives in the
> > command line:
> >
> > mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_include
> > lo,eth0,ib0,ib1 .....
> >
> > This sounds me as bug.
>
> Yes, it does. Specifying the MCA same param twice on the command line
> results in undefined behavior -- it will only take one of them, and I
> assume it'll take the first (but I'd have to check the code to be sure).

  I *think* that Marco intended to write:
  mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_exclude
lo,eth0,ib0,ib1 ...

  Is this correct? So you're not specifying include twice, you're
specifying include *and* exclude, so each interface is explicitly stated in
one list or the other. I remember encountering this behaviour as well, in a
slightly different format, but I can't seem to reproduce it now either.
That said, with these options, won't the MPI traffic (as opposed to the OOB
traffic) still use the eth1,ib0 and ib1 interfaces? You'd need to add '-mca
btl_tcp_include eth1' in order to say it should only go over that NIC, I
think.

  As for the 'connection errors', two bizarre things to check are, first,
that all of your nodes using eth1 actually have correct /etc/hosts mappings
to the other nodes. One system I ran on had this problem when some nodes
had an IP address for node002 as one thing, and another node had node002's
IP address as something different. This should be easy enough by trying to
run on one node first, then two nodes that you're sure have the correct
addresses.

  .. The second situation is if you're launching an MPMD program. Here, you
need to use '-gmca <whatever>' instead of '-mca <whatever>'.

  Hope some of that is at least a tad useful. :)

  Cheers,
  - Brian