Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-11-29 09:40:11


Hi Bob

I'm afraid the person most familiar with the oob subsystem recently left the
project, so we are somewhat hampered at the moment. I don't recognize the
"Software caused connection abort" error message - it doesn't appear to be
one of ours (at least, I couldn't find it anywhere in our code base, though
I can't swear it isn't there in some dark corner), and I don't find it in my
own sys/errno.h file.

With those caveats, all I can say is that something appears to be blocking
the connection from your remote node back to the head node. Are you sure
both nodes are available on IPv4 (since you disabled IPv6)? Can you try
ssh'ing to the remote node and doing a ping to the head node using the IPv4
interface?

Do you have another method you could use to check and see if max14 will
accept connections from max15? If I interpret the error message correctly,
it looks like something in the connect handshake is being aborted. We try a
couple of times, but then give up and try other interfaces - since no other
interface is available, you get that other error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in our
coverage that needs to be rebuilt.

Ralph
 

On 11/28/07 2:41 PM, "Bob Soliday" <soliday_at_[hidden]> wrote:

> I am new to openmpi and have a problem that I cannot seem to solve.
> I am trying to run the hello_c example and I can't get it to work.
> I compiled openmpi with:
>
> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
> --with-openib
>
> The hostname file contains the local host and one other node. When I
> run it I get:
>
>
> [soliday_at_max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --
> debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
> hello_c
> [max14:31465] [0,0,0] accepting connections via event library
> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
> [max14:31466] [0,0,1] accepting connections via event library
> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting
> port 55152 to: 192.168.2.14:38852
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> sending ack, 0
> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802
> [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14
> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> Daemon [0,0,1] checking in as pid 31466 on host max14
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed: Software caused connection abort (103)
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed: Software caused connection abort (103)
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed, connecting over all interfaces failed!
> [max15:28222] OOB: Connection to HNP lost
> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
> pls_base_orted_cmds.c at line 275
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> at line 1166
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
> line 90
> [max14:31465] ERROR: A daemon on node max15 failed to start as expected.
> [max14:31465] ERROR: There may be more information available from
> [max14:31465] ERROR: the remote shell (see above).
> [max14:31465] ERROR: The daemon exited unexpectedly with status 1.
> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [max14:31466] [0,0,1] orted_recv_pls: received exit
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed
> connection
> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6
> state 4
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
> pls_base_orted_cmds.c at line 188
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> at line 1198
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
>
>
>
> I can see that the orted deamon program is starting on both computers
> but it looks to
> me like they can't talk to each other.
>
> Here is the output from ifconfig on one of the nodes, the other node
> is similar.
>
> [root_at_max14 ~]# /sbin/ifconfig
> eth0 Link encap:Ethernet HWaddr 00:17:31:9C:93:A1
> inet addr:192.168.2.14 Bcast:192.168.2.255 Mask:
> 255.255.255.0
> inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1353 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:188125 (183.7 KiB) TX bytes:1500567 (1.4 MiB)
> Interrupt:17
>
> eth1 Link encap:Ethernet HWaddr 00:17:31:9C:93:A2
> inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:
> 255.255.255.0
> inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0
> TX packets:49368158 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:21844618928 (20.3 GiB) TX bytes:16122676331 (15.0
> GiB)
> Interrupt:19
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:82191 errors:0 dropped:0 overruns:0 frame:0
> TX packets:82191 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:7383491 (7.0 MiB) TX bytes:7383491 (7.0 MiB)
>
>
> These machines routinely run mpich2 and mvapich2 programs so I don't
> suspect any
> problems with the gigabit or infiniband connections.
>
> Thanks,
> --Bob Soliday
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users