Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection to lifeline lost
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-24 13:00:17


Looks to me like you are picking up a different OMPI installation on the remote node - check that your path and ld_library_path on the remote host are being set correctly
On Jan 24, 2014, at 9:41 AM, etcamargo <etcamargo_at_[hidden]> wrote:

> Hi, All!
>
> Please, I have a problem to run a simple "hello world" program on different hosts. The hosts are virtual machines located in the same net. The program works fine only on one host, the ssh is ok between the machines and nfs is ok, sharing the executable files between the machines.
>
> a) $ mpirun -hostfile machines -v -np 2 ./hello
>
> [achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482
> [latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38
> [latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235
> [latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline [[32727,0],0] lost
>
>
> b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2 ./hello
>
> [achel:17020] mca:base:select:( plm) Querying component [rsh]
> [achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
> [achel:17020] mca:base:select:( plm) Query of component [rsh] set priority to 10
> [achel:17020] mca:base:select:( plm) Querying component [slurm]
> [achel:17020] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
> [achel:17020] mca:base:select:( plm) Selected component [rsh]
> [achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash 2714559920
> [achel:17020] plm:base:set_hnp_name: final jobfam 1536
> [achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [achel:17020] [[1536,0],0] plm:base:receive start comm
> [achel:17020] released to spawn
> [achel:17020] [[1536,0],0] plm:base:setup_vm
> [achel:17020] [[1536,0],0] plm:base:setup_vm creating map
> [achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon [[1536,0],1]
> [achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon [[1536,0],1] to node latrappe.c3local
> [achel:17020] [[1536,0],0] plm:rsh: launching vm
> [achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash)
> [achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as local shell
> [achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash)
> [achel:17020] [[1536,0],0] plm:rsh: final template argv:
> /usr/bin/ssh <template> orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh
> [achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local
> [achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon [[1536,0],1]
> [achel:17020] [[1536,0],0] plm:base:daemon_callback
> [achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh latrappe.c3local orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh]
> [latrappe:18212] mca:base:select:( plm) Querying component [rsh]
> [latrappe:18212] mca:base:select:( plm) Query of component [rsh] set priority to 10
> [latrappe:18212] mca:base:select:( plm) Selected component [rsh]
> [achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon [[1536,0],1] via [[1536,0],1]
> [achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482
> [achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for daemon [[1536,0],1] (via [[1536,0],1]) at contact 100663296.1;tcp://10.254.222.7:33825
> [achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit commands
> [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term ordered
> [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd to [[1536,0],1]
> [achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1] sent
> [achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent
> [achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error (null)
> [latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38
> [latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235
> [achel:17020] [[1536,0],0] plm:base:receive stop comm
> [latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline [[1536,0],0] lost
>
> Thanks in advance,
>
> Edson
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users