Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection timed out with multiple nodes
From: Doug Roberts (roberpj_at_[hidden])
Date: 2014-02-26 17:33:00


o I should report there has been an important developement in this
problem, before anyone spends time on my previous post. We have
got the original test program to run without hanging by directly
connecting the two test compute nodes together (thus bypassing the
switch) as shown here, where eth2 is still the 10G interface ie)

[roberpj_at_bro127:~/samples/openmpi/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl
tcp,sm,self --mca btl_tcp_if_include eth2 --host bro127,bro128 ./a.out
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
Run 3 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
P0: Done
I am process 1 on node bro128
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Done

o This now points to the Netgear XSM7224S 10G switch. The firmware
version turns out to be slightly old at 9.0.1.14, so we will update
it to the latest 9.0.1.29 and then run the test again. I will report
back the result. In the meantime, if anyone knows of configuration
setting(s) in the switch that could block openmpi message passing
then please reply to this comment. Tx!

---------- Forwarded message ----------
Date: Tue, 25 Feb 2014 20:07:31 -0500 (EST)
From: Doug Roberts <roberpj_at_[hidden]>
To: users_at_[hidden]
Subject: Re: [OMPI users] Connection timed out with multiple nodes

Hello again, The "oob_stress" program runs cleanly on each of
the two test nodes bro127 and bro128 as shown below. Would
you say this rules out a problem with the network and switch,
or is there another test program(s) that should be run next ?

o eth0 and eth2: without plm_base_verbose

[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth0 ./oob_stress
[bro127:02020] Ring 1 message size 10 bytes
[bro127:02020] [[27318,1],0] Ring 1 completed
[bro127:02020] Ring 2 message size 100 bytes
[bro127:02020] [[27318,1],0] Ring 2 completed
[bro127:02020] Ring 3 message size 1000 bytes
[bro127:02020] [[27318,1],0] Ring 3 completed
[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth2 ./oob_stress
[bro127:02022] Ring 1 message size 10 bytes
[bro127:02022] [[27312,1],0] Ring 1 completed
[bro127:02022] Ring 2 message size 100 bytes
[bro127:02022] [[27312,1],0] Ring 2 completed
[bro127:02022] Ring 3 message size 1000 bytes
[bro127:02022] [[27312,1],0] Ring 3 completed

[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth0 ./oob_stress
[bro128:04484] Ring 1 message size 10 bytes
[bro128:04484] [[23046,1],0] Ring 1 completed
[bro128:04484] Ring 2 message size 100 bytes
[bro128:04484] [[23046,1],0] Ring 2 completed
[bro128:04484] Ring 3 message size 1000 bytes
[bro128:04484] [[23046,1],0] Ring 3 completed
[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth2 ./oob_stress
[bro128:04486] Ring 1 message size 10 bytes
[bro128:04486] [[23040,1],0] Ring 1 completed
[bro128:04486] Ring 2 message size 100 bytes
[bro128:04486] [[23040,1],0] Ring 2 completed
[bro128:04486] Ring 3 message size 1000 bytes
[bro128:04486] [[23040,1],0] Ring 3 completed

o eth2: with plm_base_verbose on

[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress
[bro127:01936] mca:base:select:( plm) Querying component [rsh]
[bro127:01936] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh path
NULL
[bro127:01936] mca:base:select:( plm) Query of component [rsh] set priority to
10
[bro127:01936] mca:base:select:( plm) Querying component [slurm]
[bro127:01936] mca:base:select:( plm) Skipping component [slurm]. Query failed
to return a module
[bro127:01936] mca:base:select:( plm) Querying component [tm]
[bro127:01936] mca:base:select:( plm) Skipping component [tm]. Query failed to
return a module
[bro127:01936] mca:base:select:( plm) Selected component [rsh]
[bro127:01936] plm:base:set_hnp_name: initial bias 1936 nodename hash
3261509427
[bro127:01936] plm:base:set_hnp_name: final jobfam 27333
[bro127:01936] [[27333,0],0] plm:base:rsh_setup on agent ssh : rsh path NULL
[bro127:01936] [[27333,0],0] plm:base:receive start comm
[bro127:01936] released to spawn
[bro127:01936] [[27333,0],0] plm:base:setup_job for job [INVALID]
[bro127:01936] [[27333,0],0] plm:rsh: launching job [27333,1]
[bro127:01936] [[27333,0],0] plm:rsh: no new daemons to launch
[bro127:01936] [[27333,0],0] plm:base:launch_apps for job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:report_launched for job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:app_report_launch from daemon
[[27333,0],0]
[bro127:01936] [[27333,0],0] plm:base:app_report_launched for proc
[[27333,1],0] from daemon [[27333,0],0]: pid 1937 state 4 exit 0
[bro127:01936] [[27333,0],0] plm:base:app_report_launch completed processing
[bro127:01936] [[27333,0],0] plm:base:report_launched all apps reported
[bro127:01936] [[27333,0],0] plm:base:launch wiring up iof
[bro127:01936] [[27333,0],0] plm:base:launch completed for job [27333,1]
[bro127:01936] completed spawn for job [27333,1]
[bro127:01937] Ring 1 message size 10 bytes
[bro127:01937] [[27333,1],0] Ring 1 completed
[bro127:01937] Ring 2 message size 100 bytes
[bro127:01937] [[27333,1],0] Ring 2 completed
[bro127:01937] Ring 3 message size 1000 bytes
[bro127:01937] [[27333,1],0] Ring 3 completed
[bro127:01936] [[27333,0],0] plm:base:receive processing msg
[bro127:01936] [[27333,0],0] plm:base:receive update proc state command
[bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for job
[27333,1]
[bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for vpid 0
state 80 exit_code 0
[bro127:01936] [[27333,0],0] plm:base:receive updating state for proc
[[27333,1],0] current state 10 new state 80
[bro127:01936] [[27333,0],0] plm:base:check_job_completed for job [27333,1] -
num_terminated 1 num_procs 1
[bro127:01936] [[27333,0],0] plm:base:check_job_completed declared job
[27333,1] normally terminated - checking all jobs
[bro127:01936] [[27333,0],0] releasing procs from node bro127
[bro127:01936] [[27333,0],0] releasing proc [[27333,1],0] from node bro127
[bro127:01936] [[27333,0],0] plm:base:check_job_completed all jobs terminated -
waking up
[bro127:01936] [[27333,0],0] plm:base:orted_cmd sending orted_exit commands
[bro127:01936] [[27333,0],0] plm:base:receive stop comm
[bro127:01936] [[27333,0],0] plm:base:local:slave:finalize

[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca
oob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress
[bro128:04462] mca:base:select:( plm) Querying component [rsh]
[bro128:04462] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh path
NULL
[bro128:04462] mca:base:select:( plm) Query of component [rsh] set priority to
10
[bro128:04462] mca:base:select:( plm) Querying component [slurm]
[bro128:04462] mca:base:select:( plm) Skipping component [slurm]. Query failed
to return a module
[bro128:04462] mca:base:select:( plm) Querying component [tm]
[bro128:04462] mca:base:select:( plm) Skipping component [tm]. Query failed to
return a module
[bro128:04462] mca:base:select:( plm) Selected component [rsh]
[bro128:04462] plm:base:set_hnp_name: initial bias 4462 nodename hash 186663077
[bro128:04462] plm:base:set_hnp_name: final jobfam 23275
[bro128:04462] [[23275,0],0] plm:base:rsh_setup on agent ssh : rsh path NULL
[bro128:04462] [[23275,0],0] plm:base:receive start comm
[bro128:04462] released to spawn
[bro128:04462] [[23275,0],0] plm:base:setup_job for job [INVALID]
[bro128:04462] [[23275,0],0] plm:rsh: launching job [23275,1]
[bro128:04462] [[23275,0],0] plm:rsh: no new daemons to launch
[bro128:04462] [[23275,0],0] plm:base:launch_apps for job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:report_launched for job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:app_report_launch from daemon
[[23275,0],0]
[bro128:04462] [[23275,0],0] plm:base:app_report_launched for proc
[[23275,1],0] from daemon [[23275,0],0]: pid 4463 state 4 exit 0
[bro128:04462] [[23275,0],0] plm:base:app_report_launch completed processing
[bro128:04462] [[23275,0],0] plm:base:report_launched all apps reported
[bro128:04462] [[23275,0],0] plm:base:launch wiring up iof
[bro128:04462] [[23275,0],0] plm:base:launch completed for job [23275,1]
[bro128:04462] completed spawn for job [23275,1]
[bro128:04463] Ring 1 message size 10 bytes
[bro128:04463] [[23275,1],0] Ring 1 completed
[bro128:04463] Ring 2 message size 100 bytes
[bro128:04463] [[23275,1],0] Ring 2 completed
[bro128:04463] Ring 3 message size 1000 bytes
[bro128:04463] [[23275,1],0] Ring 3 completed
[bro128:04462] [[23275,0],0] plm:base:receive processing msg
[bro128:04462] [[23275,0],0] plm:base:receive update proc state command
[bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for job
[23275,1]
[bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for vpid 0
state 80 exit_code 0
[bro128:04462] [[23275,0],0] plm:base:receive updating state for proc
[[23275,1],0] current state 10 new state 80
[bro128:04462] [[23275,0],0] plm:base:check_job_completed for job [23275,1] -
num_terminated 1 num_procs 1
[bro128:04462] [[23275,0],0] plm:base:check_job_completed declared job
[23275,1] normally terminated - checking all jobs
[bro128:04462] [[23275,0],0] releasing procs from node bro128
[bro128:04462] [[23275,0],0] releasing proc [[23275,1],0] from node bro128
[bro128:04462] [[23275,0],0] plm:base:check_job_completed all jobs terminated -
waking up
[bro128:04462] [[23275,0],0] plm:base:orted_cmd sending orted_exit commands
[bro128:04462] [[23275,0],0] plm:base:receive stop comm
[bro128:04462] [[23275,0],0] plm:base:local:slave:finalize

---------- Forwarded message ----------
Date: Fri, 31 Jan 2014 13:55:41 -0800
From: Ralph Castain <rhc_at_[hidden]>
Reply-To: Open MPI Users <users_at_[hidden]>
To: Open MPI Users <users_at_[hidden]>
Subject: Re: [OMPI users] Connection timed out with multiple nodes

The only relevant parts are from the application procs - orterun and the orted
don't participate in this exchange and never see the BTLs anyway.

It looks like there is just something blocking data transfer across eth2 for
some reason. I'm afraid I have no idea why - can you run a standard (i.e.,
non-MPI) test across it?

For example, I have an oob-stress program in orte/test/system. Try running it

mpirun -npernode 1 -mca oob_tcp_if_include eth2 ./oob_stress

and see if anything works. If the out-of-band can't communicate, this won't
even start - it'll just hang. If you configure OMPI --enable-debug, you can add
-mca plm_base_verbose 5 to watch the launch operation and see if the remote
daemon is able to respond.

My guess is that the answer will be "no" and that this will hang, but that
would tell us the problem is in the network and not in the TCP BTL.