Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection timed out with multiple nodes
From: Doug Roberts (roberpj_at_[hidden])
Date: 2014-02-25 20:07:31


Hello again, The "oob_stress" program runs cleanly on each of
the two test nodes bro127 and bro128 as shown below. Would
you say this rules out a problem with the network and switch,
or is there another test program(s) that should be run next ?

o eth0 and eth2: without plm_base_verbose

[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth0 ./oob_stress
[bro127:02020] Ring 1 message size 10 bytes
[bro127:02020] [[27318,1],0] Ring 1 completed
[bro127:02020] Ring 2 message size 100 bytes
[bro127:02020] [[27318,1],0] Ring 2 completed
[bro127:02020] Ring 3 message size 1000 bytes
[bro127:02020] [[27318,1],0] Ring 3 completed
[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth2 ./oob_stress
[bro127:02022] Ring 1 message size 10 bytes
[bro127:02022] [[27312,1],0] Ring 1 completed
[bro127:02022] Ring 2 message size 100 bytes
[bro127:02022] [[27312,1],0] Ring 2 completed
[bro127:02022] Ring 3 message size 1000 bytes
[bro127:02022] [[27312,1],0] Ring 3 completed

[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth0 ./oob_stress
[bro128:04484] Ring 1 message size 10 bytes
[bro128:04484] [[23046,1],0] Ring 1 completed
[bro128:04484] Ring 2 message size 100 bytes
[bro128:04484] [[23046,1],0] Ring 2 completed
[bro128:04484] Ring 3 message size 1000 bytes
[bro128:04484] [[23046,1],0] Ring 3 completed
[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth2 ./oob_stress
[bro128:04486] Ring 1 message size 10 bytes
[bro128:04486] [[23040,1],0] Ring 1 completed
[bro128:04486] Ring 2 message size 100 bytes
[bro128:04486] [[23040,1],0] Ring 2 completed
[bro128:04486] Ring 3 message size 1000 bytes
[bro128:04486] [[23040,1],0] Ring 3 completed

o eth2: with plm_base_verbose on

[roberpj_at_bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress
[bro127:01936] mca:base:select:( plm) Querying component [rsh]
[bro127:01936] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh
path NULL
[bro127:01936] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[bro127:01936] mca:base:select:( plm) Querying component [slurm]
[bro127:01936] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[bro127:01936] mca:base:select:( plm) Querying component [tm]
[bro127:01936] mca:base:select:( plm) Skipping component [tm]. Query
failed to return a module
[bro127:01936] mca:base:select:( plm) Selected component [rsh]
[bro127:01936] plm:base:set_hnp_name: initial bias 1936 nodename hash
3261509427
[bro127:01936] plm:base:set_hnp_name: final jobfam 27333
[bro127:01936] [[27333,0],0] plm:base:rsh_setup on agent ssh : rsh path
NULL
[bro127:01936] [[27333,0],0] plm:base:receive start comm
[bro127:01936] released to spawn
[bro127:01936] [[27333,0],0] plm:base:setup_job for job [INVALID]
[bro127:01936] [[27333,0],0] plm:rsh: launching job [27333,1]
[bro127:01936] [[27333,0],0] plm:rsh: no new daemons to launch
[bro127:01936] [[27333,0],0] plm:base:launch_apps for job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:report_launched for job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:app_report_launch from daemon
[[27333,0],0]
[bro127:01936] [[27333,0],0] plm:base:app_report_launched for proc
[[27333,1],0] from daemon [[27333,0],0]: pid 1937 state 4 exit 0
[bro127:01936] [[27333,0],0] plm:base:app_report_launch completed
processing
[bro127:01936] [[27333,0],0] plm:base:report_launched all apps reported
[bro127:01936] [[27333,0],0] plm:base:launch wiring up iof
[bro127:01936] [[27333,0],0] plm:base:launch completed for job [27333,1]
[bro127:01936] completed spawn for job [27333,1]
[bro127:01937] Ring 1 message size 10 bytes
[bro127:01937] [[27333,1],0] Ring 1 completed
[bro127:01937] Ring 2 message size 100 bytes
[bro127:01937] [[27333,1],0] Ring 2 completed
[bro127:01937] Ring 3 message size 1000 bytes
[bro127:01937] [[27333,1],0] Ring 3 completed
[bro127:01936] [[27333,0],0] plm:base:receive processing msg
[bro127:01936] [[27333,0],0] plm:base:receive update proc state command
[bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for
job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for
vpid 0 state 80 exit_code 0
[bro127:01936] [[27333,0],0] plm:base:receive updating state for proc
[[27333,1],0] current state 10 new state 80
[bro127:01936] [[27333,0],0] plm:base:check_job_completed for job
[27333,1] - num_terminated 1 num_procs 1
[bro127:01936] [[27333,0],0] plm:base:check_job_completed declared job
[27333,1] normally terminated - checking all jobs
[bro127:01936] [[27333,0],0] releasing procs from node bro127
[bro127:01936] [[27333,0],0] releasing proc [[27333,1],0] from node bro127
[bro127:01936] [[27333,0],0] plm:base:check_job_completed all jobs
terminated - waking up
[bro127:01936] [[27333,0],0] plm:base:orted_cmd sending orted_exit
commands
[bro127:01936] [[27333,0],0] plm:base:receive stop comm
[bro127:01936] [[27333,0],0] plm:base:local:slave:finalize

[roberpj_at_bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mca oob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress
[bro128:04462] mca:base:select:( plm) Querying component [rsh]
[bro128:04462] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh
path NULL
[bro128:04462] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[bro128:04462] mca:base:select:( plm) Querying component [slurm]
[bro128:04462] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[bro128:04462] mca:base:select:( plm) Querying component [tm]
[bro128:04462] mca:base:select:( plm) Skipping component [tm]. Query
failed to return a module
[bro128:04462] mca:base:select:( plm) Selected component [rsh]
[bro128:04462] plm:base:set_hnp_name: initial bias 4462 nodename hash
186663077
[bro128:04462] plm:base:set_hnp_name: final jobfam 23275
[bro128:04462] [[23275,0],0] plm:base:rsh_setup on agent ssh : rsh path
NULL
[bro128:04462] [[23275,0],0] plm:base:receive start comm
[bro128:04462] released to spawn
[bro128:04462] [[23275,0],0] plm:base:setup_job for job [INVALID]
[bro128:04462] [[23275,0],0] plm:rsh: launching job [23275,1]
[bro128:04462] [[23275,0],0] plm:rsh: no new daemons to launch
[bro128:04462] [[23275,0],0] plm:base:launch_apps for job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:report_launched for job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:app_report_launch from daemon
[[23275,0],0]
[bro128:04462] [[23275,0],0] plm:base:app_report_launched for proc
[[23275,1],0] from daemon [[23275,0],0]: pid 4463 state 4 exit 0
[bro128:04462] [[23275,0],0] plm:base:app_report_launch completed
processing
[bro128:04462] [[23275,0],0] plm:base:report_launched all apps reported
[bro128:04462] [[23275,0],0] plm:base:launch wiring up iof
[bro128:04462] [[23275,0],0] plm:base:launch completed for job [23275,1]
[bro128:04462] completed spawn for job [23275,1]
[bro128:04463] Ring 1 message size 10 bytes
[bro128:04463] [[23275,1],0] Ring 1 completed
[bro128:04463] Ring 2 message size 100 bytes
[bro128:04463] [[23275,1],0] Ring 2 completed
[bro128:04463] Ring 3 message size 1000 bytes
[bro128:04463] [[23275,1],0] Ring 3 completed
[bro128:04462] [[23275,0],0] plm:base:receive processing msg
[bro128:04462] [[23275,0],0] plm:base:receive update proc state command
[bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for
job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for
vpid 0 state 80 exit_code 0
[bro128:04462] [[23275,0],0] plm:base:receive updating state for proc
[[23275,1],0] current state 10 new state 80
[bro128:04462] [[23275,0],0] plm:base:check_job_completed for job
[23275,1] - num_terminated 1 num_procs 1
[bro128:04462] [[23275,0],0] plm:base:check_job_completed declared job
[23275,1] normally terminated - checking all jobs
[bro128:04462] [[23275,0],0] releasing procs from node bro128
[bro128:04462] [[23275,0],0] releasing proc [[23275,1],0] from node bro128
[bro128:04462] [[23275,0],0] plm:base:check_job_completed all jobs
terminated - waking up
[bro128:04462] [[23275,0],0] plm:base:orted_cmd sending orted_exit
commands
[bro128:04462] [[23275,0],0] plm:base:receive stop comm
[bro128:04462] [[23275,0],0] plm:base:local:slave:finalize

---------- Forwarded message ----------
Date: Fri, 31 Jan 2014 13:55:41 -0800
From: Ralph Castain <rhc_at_[hidden]>
Reply-To: Open MPI Users <users_at_[hidden]>
To: Open MPI Users <users_at_[hidden]>
Subject: Re: [OMPI users] Connection timed out with multiple nodes

The only relevant parts are from the application procs - orterun and the orted don't participate in this exchange and never see the BTLs anyway.

It looks like there is just something blocking data transfer across eth2 for some reason. I'm afraid I have no idea why - can you run a standard (i.e., non-MPI) test across it?

For example, I have an oob-stress program in orte/test/system. Try running it

mpirun -npernode 1 -mca oob_tcp_if_include eth2 ./oob_stress

and see if anything works. If the out-of-band can't communicate, this won't even start - it'll just hang. If you configure OMPI --enable-debug, you can add -mca plm_base_verbose 5 to watch the launch operation and see if the remote daemon is able to respond.

My guess is that the answer will be "no" and that this will hang, but that would tell us the problem is in the network and not in the TCP BTL.