Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)
From: Adrian Knoth (adi_at_[hidden])
Date: 2008-02-22 10:04:08

On Fri, Feb 15, 2008 at 09:02:10AM -0500, Tim Prins wrote:

> >> 3. If the exclude list does not contain 'lo', or the include list
> >> contains 'lo', the job hangs when using multiple nodes:
> > That's weird. Loopback interfaces should automatically be excluded right
> > from the beginning. See opal/util/if.c.
> I took a quick glance at this file, and I'd be lying if I said I
> understood what was going on in it. One thing I did notice is that the
> parameter btl_tcp_if_exclude defaults to 'lo', but the user can of
> course overwrite it.

I was wrong. To be more precise, there are conflicting comments in if.c:

#if 0
    if ((ifr->ifr_flags & IFF_LOOPBACK) != 0)


            /* skip interface if it is a loopback device (IFF_LOOPBACK set) */
            /* or if it is a point-to-point interface */
            /* TODO: do we really skip p2p? */
            if(0 != (cur_ifaddrs->ifa_flags & IFF_LOOPBACK)
                    || 0!= (cur_ifaddrs->ifa_flags & IFF_POINTOPOINT)) {


                if ( (! IN6_IS_ADDR_LOOPBACK (&my_addr->sin6_addr)) &&
                     (! IN6_IS_ADDR_LINKLOCAL (&my_addr->sin6_addr))) {
                    /* create interface for newly found address */


                /* generate the interface name on your own ....
                   loopback: lo
                   Rest: eth0, eth1, ..... */

                if (if_list[i].iiFlags & IFF_LOOPBACK) {
                    sprintf (intf.if_name, "lo");
                } else {
                    sprintf (intf.if_name, "eth%u", interface_counter++);

To be honest: When porting to IPv6, I've excluded lo, because I see no
use in using it.

That is what the code reflects: is included (IPv4-lo), but ::1
is excluded (IPv6-lo).

> It might be worth looking into this further. If the user got an error or
> the job aborted if they did something wrong with 'lo' I would not worry
> about it at all. But the fact that it causes a hang is worrisome to me.

It could be treated as the user's fault.

I see three approaches:

   a) remove lo globally (in if.c). I expect objections. ;)

   b) print a warning from BTL/TCP if the interfaces in use contain lo.
      Like "Warning: You've included the loopback for communication.
            This may cause hanging processes due to unreachable peers."

   c) Throw away on the remote side. But when doing so, what's
      the use for including it at all?

So as mentioned earlier: It could be the user's fault. ;) If he includes
lo, this means he wants to announce to remote peers. And this
sounds useless (unless you want local communication without SM).

Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany