Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)
From: Tim Prins (tprins_at_[hidden])
Date: 2008-02-15 09:02:10


Adrian Knoth wrote:
> On Fri, Feb 01, 2008 at 11:40:20AM -0500, Tim Prins wrote:
>
>> Adrian,
>
> Hi!
>
> Sorry for the late reply and thanks for your testing.
>
>> 1. There are some warnings when compiling:
>
> I've fixed these issues.
Thanks.
>
>> 2. If I exclude all my tcp interfaces, the connection fails properly,
>> but I do get a malloc request for 0 bytes:
>> tprins_at_odin examples]$ mpirun -mca btl tcp,self -mca btl_tcp_if_exclude
>> eth0,ib0,lo -np 2 ./ring_c
>> malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
>> malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
>> <snip>
>
> Not my fault, but I guess we could fix it anyway. Should we?
It probably should be fixed. But I've noticed that other BTLs (such as
MX) do not properly handle the case where there are no available
interfaces either...

>
>> 3. If the exclude list does not contain 'lo', or the include list
>> contains 'lo', the job hangs when using multiple nodes:
>
> That's weird. Loopback interfaces should automatically be excluded right
> from the beginning. See opal/util/if.c.
>
> I neither know nor haven't checked where things go wrong. Do you want to
> investigate? As already mentioned, this should not happen.
I took a quick glance at this file, and I'd be lying if I said I
understood what was going on in it. One thing I did notice is that the
parameter btl_tcp_if_exclude defaults to 'lo', but the user can of
course overwrite it.

It might be worth looking into this further. If the user got an error or
the job aborted if they did something wrong with 'lo' I would not worry
about it at all. But the fact that it causes a hang is worrisome to me.

>
> Can you post the output of "ip a s" or "ifconfig -a"?
It is at the end of the email.

>
>> However, the great news about this patch is that it appears to fix
>> https://svn.open-mpi.org/trac/ompi/ticket/1027 for me.
>
> It also fixes my #1206. I'd like to merge tmp-public/btl-tcp into the
> trunk, especially before the 1.3 code freeze. Any objections?
Not from me, especially now that it is already in the trunk :).

Tim

--
ifconfig -a:
eth0      Link encap:Ethernet  HWaddr 00:E0:81:2D:0B:08
           inet addr:129.79.240.101  Bcast:129.79.240.255 
Mask:255.255.255.0
           inet6 addr: 2001:18e8:2:240:2e0:81ff:fe2d:b08/64 Scope:Global
           inet6 addr: fe80::2e0:81ff:fe2d:b08/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:555918407 errors:0 dropped:2122 overruns:0 frame:0
           TX packets:569928551 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:448936694980 (418.1 GiB)  TX bytes:486030858441 
(452.6 GiB)
           Interrupt:193
eth1      Link encap:Ethernet  HWaddr 00:E0:81:2D:0B:09
           BROADCAST MULTICAST  MTU:1500  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
           Interrupt:201
ib0       Link encap:UNSPEC  HWaddr 
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
           inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0
           inet6 addr: fe80::202:c902:0:5d71/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
           RX packets:6304819 errors:0 dropped:0 overruns:0 frame:0
           TX packets:6355094 errors:0 dropped:2 overruns:0 carrier:0
           collisions:0 txqueuelen:128
           RX bytes:26794850321 (24.9 GiB)  TX bytes:35448899645 (33.0 GiB)
ib1       Link encap:UNSPEC  HWaddr 
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
           BROADCAST MULTICAST  MTU:2044  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:128
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:16436  Metric:1
           RX packets:182055033 errors:0 dropped:0 overruns:0 frame:0
           TX packets:182055033 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:997605665018 (929.0 GiB)  TX bytes:997605665018 
(929.0 GiB)
sit0      Link encap:IPv6-in-IPv4
           NOARP  MTU:1480  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
ip a s:
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
     link/ether 00:e0:81:2d:0b:08 brd ff:ff:ff:ff:ff:ff
     inet 129.79.240.101/24 brd 129.79.240.255 scope global eth0
     inet6 2001:18e8:2:240:2e0:81ff:fe2d:b08/64 scope global dynamic
        valid_lft 2591721sec preferred_lft 604521sec
     inet6 fe80::2e0:81ff:fe2d:b08/64 scope link
        valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
     link/ether 00:e0:81:2d:0b:09 brd ff:ff:ff:ff:ff:ff
4: sit0: <NOARP> mtu 1480 qdisc noop
     link/sit 0.0.0.0 brd 0.0.0.0
5: ib0: <BROADCAST,MULTICAST,UP> mtu 65520 qdisc pfifo_fast qlen 128
     link/[32] 
80:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:5d:71 brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
     inet 192.168.0.101/24 brd 192.168.0.255 scope global ib0
     inet6 fe80::202:c902:0:5d71/64 scope link
        valid_lft forever preferred_lft forever
6: ib1: <BROADCAST,MULTICAST> mtu 2044 qdisc noop qlen 128
     link/[32] 
80:00:04:05:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:5d:72 brd 
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff