Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2012-03-01 16:53:56


Actually, I should say that I discovered that if you put --prefix on each line of the app context file, then the first case (running the app context file) works fine; it adheres to the --prefix behavior.

Ralph: is this intended behavior? (I don't know if I have an opinion either way)

On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote:

> I see the problem.
>
> It looks like the use of the app context file is triggering different behavior, and that behavior is erasing the use of --prefix. If I replace the app context file with a complete command line, it works and the --prefix behavior is observed.
>
> Specifically:
>
> $mpirunfile $mcaparams --app addmpw-hostname
>
> ^^ This one seems to ignore --prefix behavior.
>
> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 -np 1 hostname
>
> ^^ These two seem to adhere to the proper --prefix behavior.
>
> Ralph -- can you have a look?
>
>
>
>
> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:
>
>> Hi Ralph,
>>
>> Thanks, here is what I did as suggested by Jeff:
>>
>>> What did this command line look like? Can you provide the configure line as well?
>>
>> As in my previous post, the script as following:
>>
>> (1) debug messages:
>>>>>
>> yiguang_at_gulftown testdmp]$ ./test.bash
>> [gulftown:28340] mca: base: components_open: Looking for plm components
>> [gulftown:28340] mca: base: components_open: opening plm components
>> [gulftown:28340] mca: base: components_open: found loaded component rsh
>> [gulftown:28340] mca: base: components_open: component rsh has no register function
>> [gulftown:28340] mca: base: components_open: component rsh open function successful
>> [gulftown:28340] mca: base: components_open: found loaded component slurm
>> [gulftown:28340] mca: base: components_open: component slurm has no register function
>> [gulftown:28340] mca: base: components_open: component slurm open function successful
>> [gulftown:28340] mca: base: components_open: found loaded component tm
>> [gulftown:28340] mca: base: components_open: component tm has no register function
>> [gulftown:28340] mca: base: components_open: component tm open function successful
>> [gulftown:28340] mca:base:select: Auto-selecting plm components
>> [gulftown:28340] mca:base:select:( plm) Querying component [rsh]
>> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10
>> [gulftown:28340] mca:base:select:( plm) Querying component [slurm]
>> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
>> [gulftown:28340] mca:base:select:( plm) Querying component [tm]
>> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module
>> [gulftown:28340] mca:base:select:( plm) Selected component [rsh]
>> [gulftown:28340] mca: base: close: component slurm closed
>> [gulftown:28340] mca: base: close: unloading component slurm
>> [gulftown:28340] mca: base: close: component tm closed
>> [gulftown:28340] mca: base: close: unloading component tm
>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048
>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell
>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>> /usr/bin/rsh <template> orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca
>> orte_ess_vpid <template> -mca orte_ess_num_procs 4 --hnp-uri
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100
>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown
>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca
>> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
>> bash: orted: command not found
>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002
>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2]
>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca
>> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
>> bash: orted: command not found
>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003
>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca
>> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100]
>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],3]
>> bash: orted: command not found
>> [gulftown:28340] [[17438,0],0] plm:base:daemon_callback
>> <<<
>>
>> (2) test.bash script:
>>>>>
>> #!/bin/sh -f
>> #nohup
>> #
>> # >-------------------------------------------------------------------------------------------<
>> adinahome=/usr/adina/system8.8dmp
>> mpirunfile=$adinahome/bin/mpirun
>> #
>> # Set envars for mpirun and orted
>> #
>> export PATH=$adinahome/bin:$adinahome/tools:$PATH
>> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
>> #
>> #
>> # run DMP problem
>> #
>> mcaprefix="--prefix $adinahome"
>> mcarshagent="--mca plm_rsh_agent rsh:ssh"
>> mcatmpdir="--mca orte_tmpdir_base /tmp"
>> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
>> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
>> mcabtlconn="--mca btl openib,sm,self"
>> mcaplmbase="--mca plm_base_verbose 100"
>>
>> mcaparams="$mcaprefix $mcaenvars $mcarshagent $mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"
>>
>> $mpirunfile $mcaparams --app addmpw-hostname
>> <<<
>>
>> (3) the contend of app file addmpw-hostname:
>>>>>
>> -n 1 -host gulftown hostname
>> -n 1 -host ibnode001 hostname
>> -n 1 -host ibnode002 hostname
>> -n 1 -host ibnode003 thostname
>> <<<
>>
>> Any comments?
>>
>> Thanks,
>> Yiguang
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/