Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes
From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2012-02-29 14:33:21


Gah. I didn't realize that my 1.4.x build was a *developer* build. *Developer* builds give a *lot* more detail with plm_base_verbose=100 (including the specific rsh command being used). You obviously didn't get that output because you don't have a developer build. :-\

Just for reference, here's what plm_base_verbose=100 tells me for running an orted on a remote node, when I use the --prefix option to mpirun (I'm a tcsh user, so the syntax below will be a little different than what is running in your environment):

-----
[svbu-mpi:28527] [[20181,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh svbu-mpi001 set path = ( /home/jsquyres/bogus/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib ; if ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ; /home/jsquyres/bogus/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 1322582016 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "1322582016.0;tcp://172.29.218.140:34815;tcp://10.148.255.1:34815" --mca plm_base_verbose 100]
-----

Ok, a few options here:

1. You can get a developer build if you use the --enable-debug option to configure. Then plm_base_verbose=100 will give a lot more info. Remember, the goal here is to see what's going wrong -- not to depend on having a developer build around.

2. If that isn't workable, make an "orted" in your default path somewhere that's a short script:

-----
:
echo ===========environment===========
env | sort
echo ===========environment end===========
sleep 10000000
-----

Then when you "mpirun", do a "ps" to see exactly what was executed on the node where mpirun was invoked and the node where orted is supposed to be running. It's not quite as descriptive as seeing the plm_base_verbose output because we run multiple shell commands, but it's something. You'll also see the stdout from the local node. You'll need to use the --leave-session-attached option to mpirun to see the output from the remote nodes.

On Feb 29, 2012, at 9:43 AM, Yiguang Yan wrote:

> Hi Jeff,
>
> Thanks.
>
> I tried as what you suggested. Here are the output:
>
>>>>
> yiguang_at_gulftown testdmp]$ ./test.bash
> [gulftown:25052] mca: base: components_open: Looking for plm
> components
> [gulftown:25052] mca: base: components_open: opening plm
> components
> [gulftown:25052] mca: base: components_open: found loaded
> component rsh
> [gulftown:25052] mca: base: components_open: component rsh
> has no register function
> [gulftown:25052] mca: base: components_open: component rsh
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded
> component slurm
> [gulftown:25052] mca: base: components_open: component slurm
> has no register function
> [gulftown:25052] mca: base: components_open: component slurm
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded
> component tm
> [gulftown:25052] mca: base: components_open: component tm
> has no register function
> [gulftown:25052] mca: base: components_open: component tm
> open function successful
> [gulftown:25052] mca:base:select: Auto-selecting plm components
> [gulftown:25052] mca:base:select:( plm) Querying component [rsh]
> [gulftown:25052] mca:base:select:( plm) Query of component [rsh]
> set priority to 10
> [gulftown:25052] mca:base:select:( plm) Querying component
> [slurm]
> [gulftown:25052] mca:base:select:( plm) Skipping component
> [slurm]. Query failed to return a module
> [gulftown:25052] mca:base:select:( plm) Querying component [tm]
> [gulftown:25052] mca:base:select:( plm) Skipping component [tm].
> Query failed to return a module
> [gulftown:25052] mca:base:select:( plm) Selected component [rsh]
> [gulftown:25052] mca: base: close: component slurm closed
> [gulftown:25052] mca: base: close: unloading component slurm
> [gulftown:25052] mca: base: close: component tm closed
> [gulftown:25052] mca: base: close: unloading component tm
> bash: orted: command not found
> bash: orted: command not found
> bash: orted: command not found
> <<<
>
>
> The following is the content of test.bash:
>>>>
> yiguang_at_gulftown testdmp]$ ./test.bash
> #!/bin/sh -f
> #nohup
> #
> # >-----------------------------------------------------------------------------------
> --------<
> adinahome=/usr/adina/system8.8dmp
> mpirunfile=$adinahome/bin/mpirun
> #
> # Set envars for mpirun and orted
> #
> export PATH=$adinahome/bin:$adinahome/tools:$PATH
> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
> #
> #
> # run DMP problem
> #
> mcaprefix="--prefix $adinahome"
> mcarshagent="--mca plm_rsh_agent rsh:ssh"
> mcatmpdir="--mca orte_tmpdir_base /tmp"
> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
> mcabtlconn="--mca btl openib,sm,self"
> mcaplmbase="--mca plm_base_verbose 100"
>
> mcaparams="$mcaprefix $mcaenvars $mcarshagent
> $mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"
>
> $mpirunfile $mcaparams --app addmpw-hostname
> <<<
>
> While the content of addmpw-hostname is:
>>>>
> -n 1 -host gulftown hostname
> -n 1 -host ibnode001 hostname
> -n 1 -host ibnode002 hostname
> -n 1 -host ibnode003 thostname
> <<<
>
> After this, I also tried to specify the orted through:
>
> --mca orte_launch_agent $adinahome/bin/orted
>
> then, orted could be found on slave nodes, but now the shared libs
> in $adinahome/lib are not on the LD_LIBRARY_PATH.
>
> Any comments?
>
> Thanks,
> Yiguang
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/