Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2012-03-01 16:51:53


I see the problem.

It looks like the use of the app context file is triggering different behavior, and that behavior is erasing the use of --prefix. If I replace the app context file with a complete command line, it works and the --prefix behavior is observed.

Specifically:

$mpirunfile $mcaparams --app addmpw-hostname

^^ This one seems to ignore --prefix behavior.

$mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
$mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 -np 1 hostname

^^ These two seem to adhere to the proper --prefix behavior.

Ralph -- can you have a look?

On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:

> Hi Ralph,
>
> Thanks, here is what I did as suggested by Jeff:
>
>> What did this command line look like? Can you provide the configure line as well?
>
> As in my previous post, the script as following:
>
> (1) debug messages:
>>>>
> yiguang_at_gulftown testdmp]$ ./test.bash
> [gulftown:28340] mca: base: components_open: Looking for plm components
> [gulftown:28340] mca: base: components_open: opening plm components
> [gulftown:28340] mca: base: components_open: found loaded component rsh
> [gulftown:28340] mca: base: components_open: component rsh has no register function
> [gulftown:28340] mca: base: components_open: component rsh open function successful
> [gulftown:28340] mca: base: components_open: found loaded component slurm
> [gulftown:28340] mca: base: components_open: component slurm has no register function
> [gulftown:28340] mca: base: components_open: component slurm open function successful
> [gulftown:28340] mca: base: components_open: found loaded component tm
> [gulftown:28340] mca: base: components_open: component tm has no register function
> [gulftown:28340] mca: base: components_open: component tm open function successful
> [gulftown:28340] mca:base:select: Auto-selecting plm components
> [gulftown:28340] mca:base:select:( plm) Querying component [rsh]
> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10
> [gulftown:28340] mca:base:select:( plm) Querying component [slurm]
> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
> [gulftown:28340] mca:base:select:( plm) Querying component [tm]
> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module
> [gulftown:28340] mca:base:select:( plm) Selected component [rsh]
> [gulftown:28340] mca: base: close: component slurm closed
> [gulftown:28340] mca: base: close: unloading component slurm
> [gulftown:28340] mca: base: close: component tm closed
> [gulftown:28340] mca: base: close: unloading component tm
> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048
> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell
> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
> /usr/bin/rsh <template> orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca
> orte_ess_vpid <template> -mca orte_ess_num_procs 4 --hnp-uri
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
> orte_tmpdir_base /tmp --mca plm_base_verbose 100
> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca
> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca
> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca
> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],3]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:base:daemon_callback
> <<<
>
> (2) test.bash script:
>>>>
> #!/bin/sh -f
> #nohup
> #
> # >-------------------------------------------------------------------------------------------<
> adinahome=/usr/adina/system8.8dmp
> mpirunfile=$adinahome/bin/mpirun
> #
> # Set envars for mpirun and orted
> #
> export PATH=$adinahome/bin:$adinahome/tools:$PATH
> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
> #
> #
> # run DMP problem
> #
> mcaprefix="--prefix $adinahome"
> mcarshagent="--mca plm_rsh_agent rsh:ssh"
> mcatmpdir="--mca orte_tmpdir_base /tmp"
> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
> mcabtlconn="--mca btl openib,sm,self"
> mcaplmbase="--mca plm_base_verbose 100"
>
> mcaparams="$mcaprefix $mcaenvars $mcarshagent $mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"
>
> $mpirunfile $mcaparams --app addmpw-hostname
> <<<
>
> (3) the contend of app file addmpw-hostname:
>>>>
> -n 1 -host gulftown hostname
> -n 1 -host ibnode001 hostname
> -n 1 -host ibnode002 hostname
> -n 1 -host ibnode003 thostname
> <<<
>
> Any comments?
>
> Thanks,
> Yiguang
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/