Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.2.5 and globus-4.0.5
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-03-15 03:41:20


My apologies - I should have read the note more closely. Still trying to
leaf through the umpteen emails that arrived during my vacation. ;-)

I take it that "globus-job-run" is executing some script that eventually
calls our mpirun? Your script must be doing some command-line parsing as I
do not recognize some of those options - it would help to see the eventual
command line being given to mpirun.

The problem here is that mpirun looks at all its available launchers to see
what can work. In this case, it duly noted that launchers for the managed
environments (e.g., Torque and slurm) will not work.

This leaves only the rsh launcher. The rsh launcher looks for "ssh" or (if
that isn't found) "rsh" to be in the path by default. In your case, it
clearly didn't find either one, so the rsh launcher indicated that it also
would not work.

As a result, mpirun aborts because no mechanism to launch processes could be
found.

Your choices remain the same as what I previously described, however. IIRC,
launching on the grid requires communication with the Globus daemons on each
of the target machines, possibly interaction with the Globus security
manager, etc. ORTE doesn't know how to do any of these things, so you will
either have to tell it how to do so, or use the "standalone" launch method.

Alternatively, if you believe you can use some ssh-like variant, then you
can provide that command to ORTE in place of the default "ssh". The
parameter would be -mca pls_rsh_agent my_ssh_replacement. Be sure this
replacement command is in your path, or provide the absolute pathname of it.
Note that the replacement command -must- accept command line options similar
to those of ssh - ORTE will replace "ssh" with whatever you give it, but the
rest of the command line will be built as if the command was "ssh".

FWIW, there are people working on integrating a Globus-aware launcher into
ORTE. I'm not entirely sure when that will be completed (it will not be
back-ported to the 1.2.x series), nor if/when that code would become part of
the OMPI distribution.

Hope that helps.
Ralph

On 3/14/08 9:01 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:

> The problem here is that you are attempting to start the application
> processes without using our mpirun. We call this a "standalone" launch.
>
> Unfortunately, OMPI doesn't currently understand how to do a standalone
> launch - ORTE will get confused and abort, as you experienced. There are two
> ways to fix this:
>
> 1. someone could write a Globus launcher for ORTE. I don't think this would
> be terribly hard. You would then use our mpirun to start the job after
> getting an allocation via some grid-compatible resource manager.
>
> 2. once we get standalone operations working, you could do what you tried.
> You will likely have to write an ESS component for Globus so the processes
> can figure out their rank.
>
> I have done some prototyping for standalone launch, and expect to have at
> least one working example in our development trunk later this month.
> However, we currently don't plan to release standalone support until
> probably 1.3.2, which likely won't come out for a few months.
>
> Hope that helps
> Ralph
>
>
> On 3/14/08 5:40 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
>> I don't know if anyone has tried to run Open MPI with globus before.
>>
>> One requirement that Open MPI currently has is that all nodes must be
>> reachable to each other via TCP. Is that true in your globus
>> environment?
>>
>>
>>
>> On Mar 10, 2008, at 11:01 AM, Christoph Spielmann wrote:
>>
>>> Hi everybody!
>>>
>>> I try to get OpenMPI and Globus to cooperate. These are the steps i
>>> executed in order to get OpenMPI working:
>>>
>>> € export PATH=/opt/openmpi/bin/:$PATH
>>> € /opt/globus/setup/globus/setup-globus-job-manager-fork
>>> checking for mpiexec... /opt/openmpi/bin//mpiexec
>>> checking for mpirun... /opt/openmpi/bin//mpirun
>>> find-fork-tools: creating ./config.status
>>> config.status: creating fork.pm
>>> € restart VDT (includes GRAM, WSGRAM, mysql, rls...)
>>> As you can see the necessary OpenMPI-executables are recognized
>>> correctly by setup-globus-job-manager-fork. But when i actually try
>>> to execute a simple mpi-program using globus-job-run i get this:
>>>
>>> globus-job-run localhost -x '(jobType=mpi)' -np 2 -s ./hypercube 0
>>> [hydra:10168] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/
>>> orte_init_stage1.c at line 312
>>> --------------------------------------------------------------------------
>>> It looks like orte_init failed for some reason; your parallel
>>> process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems. This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>> orte_pls_base_select failed
>>> --> Returned value -1 instead of ORTE_SUCCESS
>>>
>>> --------------------------------------------------------------------------
>>> [hydra:10168] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/
>>> orte_system_init.c at line 42
>>> [hydra:10168] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/
>>> orte_init.c at line 52
>>> --------------------------------------------------------------------------
>>> Open RTE was unable to initialize properly. The error occured while
>>> attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.
>>> --------------------------------------------------------------------------
>>>
>>> The MPI-program itself is okey:
>>>
>>> which mpirun && mpirun -np 2 hypercube 0
>>> /opt/openmpi/bin/mpirun
>>> Process 0 received broadcast message 'MPI_Broadcast with hypercube
>>> topology' from Process 0
>>> Process 1 received broadcast message 'MPI_Broadcast with hypercube
>>> topology' from Process 0
>>>
>>>
>>>> From what i read in the mailing list i think that something is
>>> wrong with the pls and globus. But i have no idea what could be
>>> wrong not to speak of how it could be fixed ;). so if someone would
>>> have an idea how this could be fixed, i'd be glad to hear it.
>>>
>>> Regards,
>>>
>>> Christoph
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>