Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jonathan Underwood (jonathan.underwood_at_[hidden])
Date: 2007-06-11 17:55:17


Hi,

I am seeing problems with a small linux cluster when running OpenMPI
jobs. The error message I get is:

[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110

Following the FAQ, I looked to see what this error code corresponds to:

$ perl -e 'die$!=110'
Connection timed out at -e line 1.

This error message occurs the first time one of the compute nodes,
which are on a private network, attempts to send data to the frontend
(from where the job was started with mpirun).
In actual fact, it seems that the error occurs the first time a
process on the frontend tries to send data to another process on the
frontend.

I tried to play about with things like --mca btl_tcp_if_exclude
lo,eth0, but that didn't help matters. Nothing in the FAQ section on
TCP and routing actually seemed to help.

Any advice would be very welcome

The network configurations are:

a) frontend (2 network adapters, eth1 private for the cluster):

$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CE
          inet addr:128.40.5.39 Bcast:128.40.5.255 Mask:255.255.255.0
          inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:500939570 (477.7 MiB) TX bytes:671589665 (640.4 MiB)
          Interrupt:193

eth1 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CF
          inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:188615778 (179.8 MiB) TX bytes:247305804 (235.8 MiB)
          Interrupt:201

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:1528 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:363101 (354.5 KiB) TX bytes:363101 (354.5 KiB)

$ /sbin/route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 0 0 0 eth1
128.40.5.0 * 255.255.255.0 U 0 0 0 eth0
default 128.40.5.245 0.0.0.0 UG 0 0 0 eth0

b) Compute nodes:

$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A0:72
          inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:189207 errors:0 dropped:0 overruns:0 frame:0
          TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:23075241 (22.0 MiB) TX bytes:17693363 (16.8 MiB)
          Interrupt:193

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:185 errors:0 dropped:0 overruns:0 frame:0
          TX packets:185 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12644 (12.3 KiB) TX bytes:12644 (12.3 KiB)

$ /sbin/route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 0 0 0 eth0
default frontend.cluste 0.0.0.0 UG 0 0 0 eth0

TIS
Jonathan