Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-06-12 00:08:48


Jonathan,

It will be difficult to make it works in this configuration. The problem
is that on the head node the network interface that have to be used is
eth1 while on the compute nodes is eth0. Therefore, the tcp_if_include
will not help ...

Now, if you only start processes on the compute nodes you will not have to
face this problem. Right now, I think this is the safest approach.

We have a patch for this kind of problems, but it's not yet in the trunk.
I let you know as soon as we commit it and then you will have to use the
unstable version until the patch make its way into a stable version.

   Thanks,
     george.

On Mon, 11 Jun 2007, Jonathan Underwood wrote:

> Hi,
>
> I am seeing problems with a small linux cluster when running OpenMPI
> jobs. The error message I get is:
>
> [frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=110
>
> Following the FAQ, I looked to see what this error code corresponds to:
>
> $ perl -e 'die$!=110'
> Connection timed out at -e line 1.
>
> This error message occurs the first time one of the compute nodes,
> which are on a private network, attempts to send data to the frontend
> (from where the job was started with mpirun).
> In actual fact, it seems that the error occurs the first time a
> process on the frontend tries to send data to another process on the
> frontend.
>
> I tried to play about with things like --mca btl_tcp_if_exclude
> lo,eth0, but that didn't help matters. Nothing in the FAQ section on
> TCP and routing actually seemed to help.
>
>
> Any advice would be very welcome
>
>
> The network configurations are:
>
> a) frontend (2 network adapters, eth1 private for the cluster):
>
> $ /sbin/ifconfig
> eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CE
> inet addr:128.40.5.39 Bcast:128.40.5.255 Mask:255.255.255.0
> inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0
> TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:500939570 (477.7 MiB) TX bytes:671589665 (640.4 MiB)
> Interrupt:193
>
> eth1 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CF
> inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
> inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0
> TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:188615778 (179.8 MiB) TX bytes:247305804 (235.8 MiB)
> Interrupt:201
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:1528 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:363101 (354.5 KiB) TX bytes:363101 (354.5 KiB)
>
>
>
> $ /sbin/route
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 192.168.1.0 * 255.255.255.0 U 0 0 0 eth1
> 128.40.5.0 * 255.255.255.0 U 0 0 0 eth0
> default 128.40.5.245 0.0.0.0 UG 0 0 0 eth0
>
>
>
> b) Compute nodes:
>
> $ /sbin/ifconfig
> eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A0:72
> inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
> inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:189207 errors:0 dropped:0 overruns:0 frame:0
> TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:23075241 (22.0 MiB) TX bytes:17693363 (16.8 MiB)
> Interrupt:193
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:185 errors:0 dropped:0 overruns:0 frame:0
> TX packets:185 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:12644 (12.3 KiB) TX bytes:12644 (12.3 KiB)
>
>
> $ /sbin/route
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0
> default frontend.cluste 0.0.0.0 UG 0 0 0 eth0
>
> TIS
> Jonathan
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

"We must accept finite disappointment, but we must never lose infinite
hope."
                                   Martin Luther King