Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Connection to lifeline lost
From: etcamargo (etcamargo_at_[hidden])
Date: 2014-01-24 14:31:09


You are right. The problem was solved put the entire path of one mpi
version:

/home/myuser/openmpi-x/bin/mpirun -hostfile machines -np 2 ./hello

Thanks,

Edson

Em 24-01-2014 16:00, Ralph Castain escreveu:
> Looks to me like you are picking up a different OMPI installation on
> the remote node - check that your path and ld_library_path on the
> remote host are being set correctly
> On Jan 24, 2014, at 9:41 AM, etcamargo <etcamargo_at_[hidden]> wrote:
>
>> Hi, All!
>>
>> Please, I have a problem to run a simple "hello world" program on
>> different hosts. The hosts are virtual machines located in the same
>> net. The program works fine only on one host, the ssh is ok between
>> the machines and nfs is ok, sharing the executable files between the
>> machines.
>>
>> a) $ mpirun -hostfile machines -v -np 2 ./hello
>>
>> [achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file
>> base/plm_base_launch_support.c at line 482
>> [latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38
>> [latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in
>> file ../../../orte/orted/orted_comm.c at line 235
>> [latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline
>> [[32727,0],0] lost
>>
>>
>> b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2
>> ./hello
>>
>> [achel:17020] mca:base:select:( plm) Querying component [rsh]
>> [achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>> path NULL
>> [achel:17020] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> [achel:17020] mca:base:select:( plm) Querying component [slurm]
>> [achel:17020] mca:base:select:( plm) Skipping component [slurm].
>> Query failed to return a module
>> [achel:17020] mca:base:select:( plm) Selected component [rsh]
>> [achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash
>> 2714559920
>> [achel:17020] plm:base:set_hnp_name: final jobfam 1536
>> [achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [achel:17020] [[1536,0],0] plm:base:receive start comm
>> [achel:17020] released to spawn
>> [achel:17020] [[1536,0],0] plm:base:setup_vm
>> [achel:17020] [[1536,0],0] plm:base:setup_vm creating map
>> [achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon
>> [[1536,0],1]
>> [achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon
>> [[1536,0],1] to node latrappe.c3local
>> [achel:17020] [[1536,0],0] plm:rsh: launching vm
>> [achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash)
>> [achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as
>> local shell
>> [achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash)
>> [achel:17020] [[1536,0],0] plm:rsh: final template argv:
>> /usr/bin/ssh <template> orted -mca ess env -mca orte_ess_jobid
>> 100663296 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 -mca
>> orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca
>> plm_base_verbose 5 -mca plm rsh
>> [achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local
>> [achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon
>> [[1536,0],1]
>> [achel:17020] [[1536,0],0] plm:base:daemon_callback
>> [achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh)
>> [/usr/bin/ssh latrappe.c3local orted -mca ess env -mca orte_ess_jobid
>> 100663296 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca
>> orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca
>> plm_base_verbose 5 -mca plm rsh]
>> [latrappe:18212] mca:base:select:( plm) Querying component [rsh]
>> [latrappe:18212] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> [latrappe:18212] mca:base:select:( plm) Selected component [rsh]
>> [achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon
>> [[1536,0],1] via [[1536,0],1]
>> [achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file
>> base/plm_base_launch_support.c at line 482
>> [achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for
>> daemon [[1536,0],1] (via [[1536,0],1]) at contact
>> 100663296.1;tcp://10.254.222.7:33825
>> [achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit
>> commands
>> [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term
>> ordered
>> [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd
>> to [[1536,0],1]
>> [achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1]
>> sent
>> [achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent
>> [achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error
>> (null)
>> [latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38
>> [latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in
>> file ../../../orte/orted/orted_comm.c at line 235
>> [achel:17020] [[1536,0],0] plm:base:receive stop comm
>> [latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline
>> [[1536,0],0] lost
>>
>> Thanks in advance,
>>
>> Edson
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users