Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Connection timed out with multiple nodes
From: Doug Roberts (roberpj_at_[hidden])
Date: 2014-01-17 19:59:36


1) When openmpi programs run across multiple nodes they hang
rather quickly as shown in the mpi_test example below. Note
that I am assuming the initital topology error message is a
separate issue since single node openmpi jobs run just fine.

[roberpj_at_bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self
--mca btl_tcp_if_include eth0,eth2 --mc
a btl_base_verbose 30 --debug-daemons --host bro127,bro128 ./a.out
Daemon was launched on bro128 - beginning to initialize
****************************************************************************
* Hwloc has encountered what looks like an error from the operating
system.
*
* object intersection without inclusion!
* Error occurred in topology.c line 594
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Daemon [[9945,0],1] checking in as pid 20978 on host bro128
[bro127:19340] [[9945,0],0] orted_cmd: received add_local_procs
[bro128:20978] [[9945,0],1] orted: up and running - waiting for commands!
[bro128:20978] [[9945,0],1] node[0].name bro127 daemon 0
[bro128:20978] [[9945,0],1] node[1].name bro128 daemon 1
[bro128:20978] [[9945,0],1] orted_cmd: received add_local_procs
   MPIR_being_debugged = 0
   MPIR_debug_state = 1
   MPIR_partial_attach_ok = 1
   MPIR_i_am_starter = 0
   MPIR_forward_output = 0
   MPIR_proctable_size = 2
   MPIR_proctable:
     (i, host, exe, pid) = (0, bro127,
/home/roberpj/samples/mpi_test/./a.out, 19348)
     (i, host, exe, pid) = (1, bro128,
/home/roberpj/samples/mpi_test/./a.out, 20979)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[bro128:20978] [[9945,0],1] orted_recv: received sync+nidmap from local
proc [[9945,1],1]
[bro127:19340] [[9945,0],0] orted_recv: received sync+nidmap from local
proc [[9945,1],0]
[bro128:20979] mca: base: components_open: Looking for btl components
[bro127:19348] mca: base: components_open: Looking for btl components
[bro128:20979] mca: base: components_open: opening btl components
[bro128:20979] mca: base: components_open: found loaded component self
[bro128:20979] mca: base: components_open: component self has no register
function
[bro128:20979] mca: base: components_open: component self open function
successful
[bro128:20979] mca: base: components_open: found loaded component sm
[bro128:20979] mca: base: components_open: component sm has no register
function
[bro128:20979] mca: base: components_open: component sm open function
successful
[bro128:20979] mca: base: components_open: found loaded component tcp
[bro128:20979] mca: base: components_open: component tcp register function
successful
[bro128:20979] mca: base: components_open: component tcp open function
successful
[bro127:19348] mca: base: components_open: opening btl components
[bro127:19348] mca: base: components_open: found loaded component self
[bro127:19348] mca: base: components_open: component self has no register
function
[bro127:19348] mca: base: components_open: component self open function
successful
[bro127:19348] mca: base: components_open: found loaded component sm
[bro127:19348] mca: base: components_open: component sm has no register
function
[bro127:19348] mca: base: components_open: component sm open function
successful
[bro127:19348] mca: base: components_open: found loaded component tcp
[bro127:19348] mca: base: components_open: component tcp register function
successful
[bro127:19348] mca: base: components_open: component tcp open function
successful
[bro128:20979] select: initializing btl component self
[bro128:20979] select: init of component self returned success
[bro128:20979] select: initializing btl component sm
[bro128:20979] select: init of component sm returned success
[bro128:20979] select: initializing btl component tcp
[bro128:20979] select: init of component tcp returned success
[bro127:19348] select: initializing btl component self
[bro127:19348] select: init of component self returned success
[bro127:19348] select: initializing btl component sm
[bro127:19348] select: init of component sm returned success
[bro127:19348] select: initializing btl component tcp
[bro127:19348] select: init of component tcp returned success
[bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs
[bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs
[bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs
[bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs
[bro127:19348] btl: tcp: attempting to connect() to address 10.27.2.128 on
port 4
Number of processes = 2
Test repeated 3 times for reliability
[bro128:20979] btl: tcp: attempting to connect() to address 10.27.2.127 on
port 4
[bro127:19348] btl: tcp: attempting to connect() to address 10.29.4.128 on
port 4
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
[bro127][[9945,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
   mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
^C
mpirun: killing job...
Killed by signal 2.
[bro127:19340] [[9945,0],0] orted_cmd: received exit cmd
[bro127:19340] [[9945,0],0] orted_cmd: received iof_complete cmd

2) The interfaces on bro127, bro128 compute nodes include a 1g
network on eth0 and a high speed 10GB network on eth2 such as ...

[roberpj_at_bro127:~] ifconfig
eth0 Link encap:Ethernet HWaddr 00:E0:81:C7:A8:E3
           inet addr:10.27.2.127 Bcast:10.27.2.255 Mask:255.255.254.0

eth2 Link encap:Ethernet HWaddr 90:E2:BA:2D:83:F0
           inet addr:10.29.4.127 Bcast:10.29.63.255 Mask:255.255.192.0

lo Link encap:Local Loopback
           inet addr:127.0.0.1 Mask:255.0.0.0

3) Hostnames resolve and can connect between the 10. addresses
using ssh without passwords on the internal network ...

[roberpj_at_bro127:~] host bro127
bro127.brown.sharcnet has address 10.27.2.127
[roberpj_at_bro127:~] host bro128
bro128.brown.sharcnet has address 10.27.2.128
[roberpj_at_bro127:~] host ic-bro127
ic-bro127.brown.sharcnet has address 10.29.4.127
[roberpj_at_bro127:~] host ic-bro128
ic-bro128.brown.sharcnet has address 10.29.4.128

[roberpj_at_bro127:~] ssh bro128
[roberpj_at_bro128:~]
[roberpj_at_bro127:~] ssh ic-bro128
[roberpj_at_bro128:~]

4) I'm attaching the output file "ompi_info--all_bro127.out.bz2" created
by running command: ompi_info --all >& ompi_info--all_bro127.out in case
that helps. If anything else is needed pls let me know, thankyou. -Doug