Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mca:base:select:( ess) No component selected!
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-24 15:40:19


Yes - you don't want to use orte_launch_agent at all for that purpose.
What you need to set is an info_key in your comm_spawn command for
"ompi_prefix", with the value set to the install path. The ssh
launcher will assemble the launch cmd using that info.

Ralph

On Sep 24, 2008, at 1:28 PM, Will Portnoy wrote:

> Yes, your first sentence is correct. I intend to use the unmodified
> orted, but I need to set up the unix environment after the ssh has
> completed but before orted is executed.
>
> In particular, one of the more important tasks for me to do after ssh
> connects is to set LD_LIBRARY_PATH and PATH to include the paths of
> the openmpi's install lib and bin directories, respectively.
> Otherwise, orted will not be on the PATH, and its dependent libraries
> will not be in LD_LIBRARY_PATH.
>
> Is there a recommended method to set LD_LIBRARY_PATH and PATH when ssh
> is used to connect to other hosts when running an mpi job?
>
> thank you,
>
> Will
>
> On Wed, Sep 24, 2008 at 2:36 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> So this is a singleton comm_spawn scenario, that requires you
>> specify a
>> launch_agent to execute? Just trying to ensure I understand.
>>
>> First, let me ensure we have a common understanding of what
>> orte_launch_agent does. Basically, that param stipulates the
>> command to be
>> used in place of "orted" - it doesn't substitute for "ssh". So if
>> you set
>> -mca orte_launch_agent foo, what will happen is: "ssh nodename foo"
>> instead
>> of "ssh nodename orted".
>>
>> The intent was to provide a way to do things like run valgrind on
>> the orted
>> itself. So you could do -mca orte_launch_agent "valgrind orted",
>> and we
>> would dutifully run "ssh nodename valrind orted".
>>
>> Or if you wanted to write your own orted (e.g., bar-orted), you could
>> substitute it for our "orted".
>>
>> Or if you wanted to set mca params solely to be seen on the backend
>> nodes/procs, you could set -mca orte_launch_agent "orted -mca foo
>> bar", and
>> we would launch "ssh nodename orted -mca foo bar". This allows us
>> to set mca
>> params without having mpirun see them - helps us to look at debug
>> output,
>> for example, from only the backend procs.
>>
>> If what you need to do is set something in the environment for the
>> orted,
>> there are certain cmd line options that will do that for you -
>> orte_launch_agent may or may not be a good method.
>>
>> Perhaps it would help if you could tell me exactly what you wanted
>> to have
>> orte_launch_agent actually do?
>>
>> Thanks
>> Ralph
>>
>> On Sep 24, 2008, at 12:22 PM, Will Portnoy wrote:
>>
>>> Sorry for the miscommunication: The processes are started by my
>>> program with MPI_Comm_spawn, so there was no mpirun involved.
>>>
>>> If you can suggest a test program I can use with mpirun to
>>> validate my
>>> openmpi environment and install, that would probably produce the
>>> output you would like to see.
>>>
>>> But I'm not sure that will make it clear how the file pointed to by
>>> "orte_launch_agent" in "mca-params.conf" should be written to
>>> setup an
>>> environment and start orted.
>>>
>>> Will
>>>
>>> On Wed, Sep 24, 2008 at 2:17 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>> Afraid I am confused. This was the entire output from the job??
>>>> If so,
>>>> then
>>>> that means mpirun itself wasn't able to find a launch environment
>>>> it
>>>> could
>>>> use, so you never got to the point of actually launching an orted.
>>>>
>>>> Do you have ssh in your path? My best immediate guess is that you
>>>> don't,
>>>> and
>>>> that mpirun therefore doesn't see anything it can use to launch a
>>>> job. We
>>>> have discussed internally that we need to improve that error
>>>> message -
>>>> could
>>>> be this is another case emphasizing that point.
>>>>
>>>> 1.3 is fine to use - still patching some bugs, but nothing that
>>>> should
>>>> impact this issue.
>>>>
>>>> Ralph
>>>>
>>>> On Sep 24, 2008, at 12:11 PM, Will Portnoy wrote:
>>>>
>>>>> That was the output with plm_base_verbose set to 99 - it's the
>>>>> same
>>>>> output with 1.
>>>>>
>>>>> Yes, I'd like to use ssh.
>>>>>
>>>>> orted wasn't starting properly with orte_launch_agent (which was
>>>>> needed because my environment on the target machine wasn't set
>>>>> up), so
>>>>> that's why I thought I would try it directly on the command line
>>>>> on
>>>>> localhost. I thought this was a simpler case: to verify that
>>>>> orted
>>>>> could find all of its necessary components without the
>>>>> complexity of
>>>>> everything else I'm doing.
>>>>>
>>>>> If I needed to use orte_launch_agent, how should I pass the
>>>>> necessary
>>>>> parameters to start orted after I set up my environment?
>>>>>
>>>>> Am I better off using trunk over 1.3?
>>>>>
>>>>> thank you,
>>>>>
>>>>> Will
>>>>>
>>>>> On Wed, Sep 24, 2008 at 2:01 PM, Ralph Castain <rhc_at_[hidden]>
>>>>> wrote:
>>>>>>
>>>>>> Could you rerun that with -mca plm_base_verbose 1? What
>>>>>> environment are
>>>>>> you
>>>>>> in - I assume rsh/ssh?
>>>>>>
>>>>>> I would like to see the cmd line being used to launch the
>>>>>> orted. What
>>>>>> this
>>>>>> indicates is that we are not getting the cmd line correct.
>>>>>> Could just
>>>>>> be
>>>>>> that some patch in the trunk didn't get completely applied to
>>>>>> the 1.3
>>>>>> branch.
>>>>>>
>>>>>> BTW: you probably can't run orted directly off of the cmd line.
>>>>>> It
>>>>>> likely
>>>>>> needs some cmd line params to get critical info.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>> On Sep 24, 2008, at 9:47 AM, Will Portnoy wrote:
>>>>>>
>>>>>>> I'm trying to use MPI_Comm_Spawn with MPI_Info's host key to
>>>>>>> spawn
>>>>>>> processes from a process not started with mpirun. This works
>>>>>>> with the
>>>>>>> host key set to the localhost's hostname, but it does not work
>>>>>>> when I
>>>>>>> use other hosts.
>>>>>>>
>>>>>>> I'm using version 1.3a1r19602. I need to use
>>>>>>> orte_launch_agent to set
>>>>>>> up my environment a bit before orted is started, but it fails
>>>>>>> with
>>>>>>> errors listed below.
>>>>>>>
>>>>>>> When I try to run orted directly on the command line with some
>>>>>>> of the
>>>>>>> verbosity flags turned to "11", I receive the same messages.
>>>>>>>
>>>>>>> Does anybody have any suggestions?
>>>>>>>
>>>>>>> thank you,
>>>>>>>
>>>>>>> Will
>>>>>>>
>>>>>>>
>>>>>>> [fqdn:24761] mca: base: components_open: Looking for ess
>>>>>>> components
>>>>>>> [fqdn:24761] mca: base: components_open: opening ess components
>>>>>>> [fqdn:24761] mca: base: components_open: found loaded
>>>>>>> component env
>>>>>>> [fqdn:24761] mca: base: components_open: component env has no
>>>>>>> register
>>>>>>> function
>>>>>>> [fqdn:24761] mca: base: components_open: component env open
>>>>>>> function
>>>>>>> successful
>>>>>>> [fqdn:24761] mca: base: components_open: found loaded
>>>>>>> component hnp
>>>>>>> [fqdn:24761] mca: base: components_open: component hnp has no
>>>>>>> register
>>>>>>> function
>>>>>>> [fqdn:24761] mca: base: components_open: component hnp open
>>>>>>> function
>>>>>>> successful
>>>>>>> [fqdn:24761] mca: base: components_open: found loaded component
>>>>>>> singleton
>>>>>>> [fqdn:24761] mca: base: components_open: component singleton
>>>>>>> has no
>>>>>>> register function
>>>>>>> [fqdn:24761] mca: base: components_open: component singleton
>>>>>>> open
>>>>>>> function successful
>>>>>>> [fqdn:24761] mca: base: components_open: found loaded
>>>>>>> component slurm
>>>>>>> [fqdn:24761] mca: base: components_open: component slurm has no
>>>>>>> register function
>>>>>>> [fqdn:24761] mca: base: components_open: component slurm open
>>>>>>> function
>>>>>>> successful
>>>>>>> [fqdn:24761] mca: base: components_open: found loaded
>>>>>>> component tool
>>>>>>> [fqdn:24761] mca: base: components_open: component tool has no
>>>>>>> register
>>>>>>> function
>>>>>>> [fqdn:24761] mca: base: components_open: component tool open
>>>>>>> function
>>>>>>> successful
>>>>>>> [fqdn:24761] mca:base:select: Auto-selecting ess components
>>>>>>> [fqdn:24761] mca:base:select:( ess) Querying component [env]
>>>>>>> [fqdn:24761] mca:base:select:( ess) Skipping component [env].
>>>>>>> Query
>>>>>>> failed to return a module
>>>>>>> [fqdn:24761] mca:base:select:( ess) Querying component [hnp]
>>>>>>> [fqdn:24761] mca:base:select:( ess) Skipping component [hnp].
>>>>>>> Query
>>>>>>> failed to return a module
>>>>>>> [fqdn:24761] mca:base:select:( ess) Querying component
>>>>>>> [singleton]
>>>>>>> [fqdn:24761] mca:base:select:( ess) Skipping component
>>>>>>> [singleton].
>>>>>>> Query failed to return a module
>>>>>>> [fqdn:24761] mca:base:select:( ess) Querying component [slurm]
>>>>>>> [fqdn:24761] mca:base:select:( ess) Skipping component
>>>>>>> [slurm]. Query
>>>>>>> failed to return a module
>>>>>>> [fqdn:24761] mca:base:select:( ess) Querying component [tool]
>>>>>>> [fqdn:24761] mca:base:select:( ess) Skipping component
>>>>>>> [tool]. Query
>>>>>>> failed to return a module
>>>>>>> [fqdn:24761] mca:base:select:( ess) No component selected!
>>>>>>> [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>> file
>>>>>>> runtime/orte_init.c at line 125
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> It looks like orte_init failed for some reason; your parallel
>>>>>>> process
>>>>>>> is
>>>>>>> likely to abort. There are many reasons that a parallel
>>>>>>> process can
>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>> environment problems. This failure appears to be an internal
>>>>>>> failure;
>>>>>>> here's some additional information (which may only be relevant
>>>>>>> to an
>>>>>>> Open MPI developer):
>>>>>>>
>>>>>>> orte_ess_base_select failed
>>>>>>> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>> file
>>>>>>> orted/orted_main.c at line 315
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users