Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection timed out with multiple nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-23 23:16:27


It's the failure on readv that's the source of the trouble. What happens if you only if_include eth2? Does it work then?

On Jan 23, 2014, at 5:38 PM, Doug Roberts <roberpj_at_[hidden]> wrote:

>
>> Date: Fri, 17 Jan 2014 19:24:50 -0800
>> From: Ralph Castain <rhc_at_[hidden]>
>>
>> The most common cause of this problem is a firewall between the
>> nodes - you can ssh across, but not communicate. Have you checked
>> to see that the firewall is turned off?
>
> Turns out some iptables rules (typical on our clusters) were active.
> They are now turned off for continued testing as suggested. I have
> rerun the mpi_test code, this time using a debug enabled build of openmpi/1.6.5 keeping with the intel compiler.
>
> As shown below the problem is still there. I'm including some gdb
> output this time. The job is shown to succeed using only eth0 over
> 1g but hang nearly immediately when the eth2 over 10g interface is
> included. Any more suggestions would be greatly appreciated.
>
> [roberpj_at_bro127:~/samples/mpi_test] mpicc -g mpi_test.c
>
> o Using eth0 only:
>
> [roberpj_at_bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i
> f_include eth0 --host bro127,bro128 ./a.out
> Number of processes = 2
> Test repeated 3 times for reliability
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> P0: Received from to P1
> Run 2 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> P1: Sending to to P0
> P1: Waiting to receive from to P0
> P0: Received from to P1
> Run 3 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> P1: Sending to to P0
> P1: Waiting to receive from to P0
> P1: Sending to to P0
> P1: Done
> P0: Received from to P1
> P0: Done
>
> o Using eth2 only:
>
> [roberpj_at_bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i
> f_include eth0,eth2 --host bro127,bro128 ./a.out
> Number of processes = 2
> Test repeated 3 times for reliability
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> ^Cmpirun: killing job...
>
> o Using eth0,eth2 with verbosity:
>
> [roberpj_at_bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i
> f_include eth0,eth2 --mca btl_base_verbose 100 --host bro127,bro128 ./a.out
> [bro127:20157] mca: base: components_open: Looking for btl components
> [bro127:20157] mca: base: components_open: opening btl components
> [bro127:20157] mca: base: components_open: found loaded component self
> [bro127:20157] mca: base: components_open: component self has no register function
> [bro127:20157] mca: base: components_open: component self open function successful
> [bro127:20157] mca: base: components_open: found loaded component sm
> [bro127:20157] mca: base: components_open: component sm has no register function
> [bro128:23354] mca: base: components_open: Looking for btl components
> [bro127:20157] mca: base: components_open: component sm open function successful
> [bro127:20157] mca: base: components_open: found loaded component tcp
> [bro127:20157] mca: base: components_open: component tcp register function successful
> [bro127:20157] mca: base: components_open: component tcp open function successful
> [bro128:23354] mca: base: components_open: opening btl components
> [bro128:23354] mca: base: components_open: found loaded component self
> [bro128:23354] mca: base: components_open: component self has no register function
> [bro128:23354] mca: base: components_open: component self open function successful
> [bro128:23354] mca: base: components_open: found loaded component sm
> [bro128:23354] mca: base: components_open: component sm has no register function
> [bro128:23354] mca: base: components_open: component sm open function successful
> [bro128:23354] mca: base: components_open: found loaded component tcp
> [bro128:23354] mca: base: components_open: component tcp register function successful
> [bro128:23354] mca: base: components_open: component tcp open function successful
> [bro127:20157] select: initializing btl component self
> [bro127:20157] select: init of component self returned success
> [bro127:20157] select: initializing btl component sm
> [bro127:20157] select: init of component sm returned success
> [bro127:20157] select: initializing btl component tcp
> [bro127:20157] select: init of component tcp returned success
> [bro128:23354] select: initializing btl component self
> [bro128:23354] select: init of component self returned success
> [bro128:23354] select: initializing btl component sm
> [bro128:23354] select: init of component sm returned success
> [bro128:23354] select: initializing btl component tcp
> [bro128:23354] select: init of component tcp returned success
> [bro127:20157] btl: tcp: attempting to connect() to address 10.27.2.128 on port 4
> Number of processes = 2
> Test repeated 3 times for reliability
> [bro128:23354] btl: tcp: attempting to connect() to address 10.27.2.127 on port 4
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> [bro127:20157] btl: tcp: attempting to connect() to address 10.29.4.128 on port 4
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> [bro127][[9184,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215
> :mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> ^C mpirun: killing job...
>
> o Master node bro127 debugging info:
>
> [roberpj_at_bro127:~] gdb -p 21067
> (gdb) bt
> #0 0x00002ac7ae4a86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x00002ac7acc3dedc in epoll_dispatch (base=0x3, arg=0x1916850, tv=0x20) at ../../../../openmpi-1.6.5/opal/event/epoll.c:215
> #2 0x00002ac7acc3f276 in opal_event_base_loop (base=0x3, flags=26306640) at ../../../../openmpi-1.6.5/opal/event/event.c:838
> #3 0x00002ac7acc3f122 in opal_event_loop (flags=3) at ../../../../openmpi-1.6.5/opal/event/event.c:766
> #4 0x00002ac7acc82c14 in opal_progress () at ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
> #5 0x00002ac7b21a8c40 in mca_pml_ob1_recv (addr=0x3, count=26306640, datatype=0x20, src=-1, tag=0, comm=0x80000, status=0x7fff15ad5f38)
> at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
> #6 0x00002ac7acb830f7 in PMPI_Recv (buf=0x3, count=26306640, type=0x20, source=-1, tag=0, comm=0x80000, status=0x4026e0) at precv.c:78
> #7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72
> (gdb) frame 7
> #7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72
> 72 MPI_Recv(&A[0], M, MPI_DOUBLE, procs-1, msgid, MPI_COMM_WORLD, &stat);
> (gdb)
>
> confirming ...
> [root_at_bro127:~] iptables --list
> Chain INPUT (policy ACCEPT)
> target prot opt source destination
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source destination
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source destination
>
> o Slave node bro128 debugging info:
>
> [roberpj_at_bro128:~] top -u roberpj
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 24334 roberpj 20 0 115m 5208 3216 R 100.0 0.0 2:32.12 a.out
>
> [roberpj_at_bro128:~] gdb -p 24334
> (gdb) bt
> #0 0x00002b7475cc86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x00002b747445dedc in epoll_dispatch (base=0x3, arg=0x9b6850, tv=0x20) at ../../../../openmpi-1.6.5/opal/event/epoll.c:215
> #2 0x00002b747445f276 in opal_event_base_loop (base=0x3, flags=10184784) at ../../../../openmpi-1.6.5/opal/event/event.c:838
> #3 0x00002b747445f122 in opal_event_loop (flags=3) at ../../../../openmpi-1.6.5/opal/event/event.c:766
> #4 0x00002b74744a2c14 in opal_progress () at ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
> #5 0x00002b74799c8c40 in mca_pml_ob1_recv (addr=0x3, count=10184784, datatype=0x20, src=-1, tag=10899040, comm=0x0, status=0x7fff1ce5e778)
> at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
> #6 0x00002b74743a30f7 in PMPI_Recv (buf=0x3, count=10184784, type=0x20, source=-1, tag=10899040, comm=0x0, status=0x4026e0) at precv.c:78
> #7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76
> (gdb) frame 7
> #7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76
> 76 MPI_Recv(&A[0], M, MPI_DOUBLE, myid-1, msgid, MPI_COMM_WORLD, &stat);
> (gdb)
>
> confirming ...
> [root_at_bro128:~] iptables --list
> Chain INPUT (policy ACCEPT)
> target prot opt source destination
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source destination
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source destination
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users