Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI over tcp
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2012-05-04 15:12:31


On 5/4/2012 1:17 PM, Don Armstrong wrote:
> On Fri, 04 May 2012, Rolf vandeVaart wrote:
>> On Behalf Of Don Armstrong
>>> On Thu, 03 May 2012, Rolf vandeVaart wrote:
>>>> 2. If that works, then you can also run with a debug switch to
>>>> see what connections are being made by MPI.
>>> You can see the connections being made in the attached log:
>>>
>>> [archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
>>> 138.23.141.162 on port 2001
>> Yes, I missed that. So, can we simplify the problem. Can you run
>> with np=2 and one process on each node?
> It hangs in exactly the same spot without completing the initial
> sm-based message. [Specifically, the GUID sending and acking appears
> to complete on the tcp connection, but the actual traffic is never
> sent, and the
> ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);
> never clears).
>
>> Also, maybe you can send the ifconfig output from each node. We
>> sometimes see this type of hanging when a node has two different
>> interfaces on the same subnet.
> 1: lo:<LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
> inet6 ::1/128 scope host
> valid_lft forever preferred_lft forever
> 2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
> link/ether 00:30:48:7d:82:54 brd ff:ff:ff:ff:ff:ff
> inet 138.23.140.43/23 brd 138.23.141.255 scope global eth0
> inet 172.16.30.79/24 brd 172.16.30.255 scope global eth0:1
> inet6 fe80::230:48ff:fe7d:8254/64 scope link
> valid_lft forever preferred_lft forever
> 3: eth1:<NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000
> link/ether 00:30:48:7d:82:55 brd ff:ff:ff:ff:ff:ff
> inet6 fd74:56b0:69d6::2101/118 scope global
> valid_lft forever preferred_lft forever
> inet6 fe80::230:48ff:fe7d:8255/64 scope link
> valid_lft forever preferred_lft forever
> 16: tun0:<POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100
> link/none
> inet 10.134.0.6/24 brd 10.134.0.255 scope global tun0
> 17: tun1:<POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100
> link/none
> inet 10.137.0.201/24 brd 10.137.0.255 scope global tun1
>
> 1: lo:<LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
> inet6 ::1/128 scope host
> valid_lft forever preferred_lft forever
> 2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
> link/ether 00:17:a4:4b:7c:ea brd ff:ff:ff:ff:ff:ff
> inet 172.16.30.110/24 brd 172.16.30.255 scope global eth0:1
> inet 138.23.141.162/23 brd 138.23.141.255 scope global eth0
> inet6 fe80::217:a4ff:fe4b:7cea/64 scope link
> valid_lft forever preferred_lft forever
> 3: eth1:<BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
> link/ether 00:17:a4:4b:7c:ec brd ff:ff:ff:ff:ff:ff
> 7: tun0:<POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100
> link/none
> inet 10.134.0.26/24 brd 10.134.0.255 scope global tun0
>
>> Assuming there are multiple interfaces, can you experiment with the runtime flags outlined here?
>> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
> It's already running with btl_tcp_if_include=eth0 and
> oob_tcp_if_include=eth0; the connections are happening only on eth0,
> which has the 138.23.141.16 addresses.
Sorry if this is a stupid question but what is eth0:1 (it's under
eth0). Are the 172.16.30.X addresses pingable to each other?

-- 
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>