Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
From: Bob Soliday (soliday_at_[hidden])
Date: 2007-11-29 11:34:48


I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the next
call to connect can immediately return ECONNABORTED and not try to actually
connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect
again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step. My
best guess at the moment is that it is using eth0 initially when I want it
to use eth1. This fails and then when it moves on to eth1 I run into the
"can't call connect after it just failed bug".

--Bob

Ralph H Castain wrote:
> Hi Bob
>
> I'm afraid the person most familiar with the oob subsystem recently left the
> project, so we are somewhat hampered at the moment. I don't recognize the
> "Software caused connection abort" error message - it doesn't appear to be
> one of ours (at least, I couldn't find it anywhere in our code base, though
> I can't swear it isn't there in some dark corner), and I don't find it in my
> own sys/errno.h file.
>
> With those caveats, all I can say is that something appears to be blocking
> the connection from your remote node back to the head node. Are you sure
> both nodes are available on IPv4 (since you disabled IPv6)? Can you try
> ssh'ing to the remote node and doing a ping to the head node using the IPv4
> interface?
>
> Do you have another method you could use to check and see if max14 will
> accept connections from max15? If I interpret the error message correctly,
> it looks like something in the connect handshake is being aborted. We try a
> couple of times, but then give up and try other interfaces - since no other
> interface is available, you get that other error message and we abort.
>
> Sorry I can't be more help - like I said, this is now a weak spot in our
> coverage that needs to be rebuilt.
>
> Ralph
>
>
>
> On 11/28/07 2:41 PM, "Bob Soliday" <soliday_at_[hidden]> wrote:
>
>> I am new to openmpi and have a problem that I cannot seem to solve.
>> I am trying to run the hello_c example and I can't get it to work.
>> I compiled openmpi with:
>>
>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
>> --with-openib
>>
>> The hostname file contains the local host and one other node. When I
>> run it I get:
>>
>>
>> [soliday_at_max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --
>> debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
>> hello_c
>> [max14:31465] [0,0,0] accepting connections via event library
>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
>> [max14:31466] [0,0,1] accepting connections via event library
>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting
>> port 55152 to: 192.168.2.14:38852
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
>> sending ack, 0
>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802
>> [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14
>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
>> Daemon [0,0,1] checking in as pid 31466 on host max14
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>> 192.168.1.14:38852 failed, connecting over all interfaces failed!
>> [max15:28222] OOB: Connection to HNP lost
>> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
>> [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
>> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
>> pls_base_orted_cmds.c at line 275
>> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>> at line 1166
>> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
>> line 90
>> [max14:31465] ERROR: A daemon on node max15 failed to start as expected.
>> [max14:31465] ERROR: There may be more information available from
>> [max14:31465] ERROR: the remote shell (see above).
>> [max14:31465] ERROR: The daemon exited unexpectedly with status 1.
>> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
>> [max14:31466] [0,0,1] orted_recv_pls: received exit
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
>> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed
>> connection
>> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6
>> state 4
>> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
>> pls_base_orted_cmds.c at line 188
>> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>> at line 1198
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>> --------------------------------------------------------------------------
>>
>>
>>
>> I can see that the orted deamon program is starting on both computers
>> but it looks to
>> me like they can't talk to each other.
>>
>> Here is the output from ifconfig on one of the nodes, the other node
>> is similar.
>>
>> [root_at_max14 ~]# /sbin/ifconfig
>> eth0 Link encap:Ethernet HWaddr 00:17:31:9C:93:A1
>> inet addr:192.168.2.14 Bcast:192.168.2.255 Mask:
>> 255.255.255.0
>> inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:1353 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:188125 (183.7 KiB) TX bytes:1500567 (1.4 MiB)
>> Interrupt:17
>>
>> eth1 Link encap:Ethernet HWaddr 00:17:31:9C:93:A2
>> inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:
>> 255.255.255.0
>> inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:49368158 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:21844618928 (20.3 GiB) TX bytes:16122676331 (15.0
>> GiB)
>> Interrupt:19
>>
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:82191 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:82191 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:7383491 (7.0 MiB) TX bytes:7383491 (7.0 MiB)
>>
>>
>> These machines routinely run mpich2 and mvapich2 programs so I don't
>> suspect any
>> problems with the gigabit or infiniband connections.
>>
>> Thanks,
>> --Bob Soliday
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>