Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Connection to lifeline lost
From: etcamargo (etcamargo_at_[hidden])
Date: 2014-01-24 12:41:38


Hi, All!

Please, I have a problem to run a simple "hello world" program on
different hosts. The hosts are virtual machines located in the same net.
The program works fine only on one host, the ssh is ok between the
machines and nfs is ok, sharing the executable files between the
machines.

a) $ mpirun -hostfile machines -v -np 2 ./hello

[achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file
base/plm_base_launch_support.c at line 482
[latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38
[latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../orte/orted/orted_comm.c at line 235
[latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline
[[32727,0],0] lost

b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2 ./hello

[achel:17020] mca:base:select:( plm) Querying component [rsh]
[achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
NULL
[achel:17020] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[achel:17020] mca:base:select:( plm) Querying component [slurm]
[achel:17020] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[achel:17020] mca:base:select:( plm) Selected component [rsh]
[achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash
2714559920
[achel:17020] plm:base:set_hnp_name: final jobfam 1536
[achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[achel:17020] [[1536,0],0] plm:base:receive start comm
[achel:17020] released to spawn
[achel:17020] [[1536,0],0] plm:base:setup_vm
[achel:17020] [[1536,0],0] plm:base:setup_vm creating map
[achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon [[1536,0],1]
[achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon
[[1536,0],1] to node latrappe.c3local
[achel:17020] [[1536,0],0] plm:rsh: launching vm
[achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash)
[achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as local
shell
[achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash)
[achel:17020] [[1536,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template> orted -mca ess env -mca orte_ess_jobid
100663296 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 -mca
orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca
plm_base_verbose 5 -mca plm rsh
[achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local
[achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon
[[1536,0],1]
[achel:17020] [[1536,0],0] plm:base:daemon_callback
[achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh)
[/usr/bin/ssh latrappe.c3local orted -mca ess env -mca orte_ess_jobid
100663296 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca
orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca
plm_base_verbose 5 -mca plm rsh]
[latrappe:18212] mca:base:select:( plm) Querying component [rsh]
[latrappe:18212] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[latrappe:18212] mca:base:select:( plm) Selected component [rsh]
[achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon
[[1536,0],1] via [[1536,0],1]
[achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file
base/plm_base_launch_support.c at line 482
[achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for
daemon [[1536,0],1] (via [[1536,0],1]) at contact
100663296.1;tcp://10.254.222.7:33825
[achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit
commands
[achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term
ordered
[achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd to
[[1536,0],1]
[achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1]
sent
[achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent
[achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error (null)
[latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38
[latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in file
../../../orte/orted/orted_comm.c at line 235
[achel:17020] [[1536,0],0] plm:base:receive stop comm
[latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline
[[1536,0],0] lost

Thanks in advance,

Edson