Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-07-18 13:59:11


Tim has proposed a clever fix that I had not thought of - just be aware that
it could cause unexpected behavior at some point. Still, for what you are
trying to do, that might meet your needs.

Ralph

On 7/18/07 11:44 AM, "Tim Prins" <tprins_at_[hidden]> wrote:

> Adam C Powell IV wrote:
>> As mentioned, I'm running in a chroot environment, so rsh and ssh won't
>> work: "rsh localhost" will rsh into the primary local host environment,
>> not the chroot, which will fail.
>>
>> [The purpose is to be able to build and test MPI programs in the Debian
>> unstable distribution, without upgrading the whole machine to unstable.
>> Though most machines I use for this purpose run Debian stable or
>> testing, the machine I'm currently using runs a very old Fedora, for
>> which I don't think OpenMPI is available.]
>
> Allright, I understand what you are trying to do now. To be honest, I
> don't think we have ever really thought about this use case. We always
> figured that to test Open MPI people would simply install it in a
> different directory and use it from there.
>
>>
>> With MPICH, mpirun -np 1 just runs the new process in the current
>> context, without rsh/ssh, so it works in a chroot. Does OpenMPI not
>> support this functionality?
>
> Open MPI does support this functionality. First, a bit of explanation:
>
> We use 'pls' (process launching system) components to handling the
> launching of processes. There are components for slurm, gridengine, rsh,
> and others. At runtime we open each of these components and query them
> as to whether they can be used. The original error you posted says that
> none of the 'pls' components can be used because all of they detected
> they could not run in your setup. The slurm one excluded itself because
> there were no environment variables set indicating it is running under
> SLURM. Similarly, the gridengine pls said it cannot run as well. The
> 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are
> available (I assume this is the case, though you did not explicitly say
> they were not available).
>
> But in this case, you do want the 'rsh' pls to be used. It will
> automatically fork any local processes, and will user rsh/ssh to launch
> any remote processes. Again, I don't think we ever imagined the use case
> on a UNIX-like system where there are no launchers like SLURM
> available, and rsh/ssh also wasn't available (Open MPI is, after all,
> primarily concerned with multi-node operation).
>
> So, there are several ways around this:
>
> 1. Make rsh or ssh available, even though they will not be used.
>
> 2. Tell the 'rsh' pls component to use a dummy program such as
> /bin/false by adding the following to the command line:
> -mca pls_rsh_agent /bin/false
>
> 3. Create a dummy 'rsh' executable that is available in your path.
>
> For instance:
>
> [tprins_at_odin ~]$ which ssh
> /usr/bin/which: no ssh in
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins_at_odin ~]$ which rsh
> /usr/bin/which: no rsh in
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins_at_odin ~]$ mpirun -np 1 hostname
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 317
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_pls_base_select failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
>
> --------------------------------------------------------------------------
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 46
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> orterun.c at line 399
>
> [tprins_at_odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false hostname
> odin.cs.indiana.edu
>
> [tprins_at_odin ~]$ touch usr/bin/rsh
> [tprins_at_odin ~]$ chmod +x usr/bin/rsh
> [tprins_at_odin ~]$ mpirun -np 1 hostname
> odin.cs.indiana.edu
> [tprins_at_odin ~]$
>
>
> I hope this helps,
>
> Tim
>
>>
>> Thanks,
>> Adam
>>
>> On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
>>> This is strange. I assume that you what to use rsh or ssh to launch the
>>> processes?
>>>
>>> If you want to use ssh, does "which ssh" find ssh? Similarly, if you
>>> want to use rsh, does "which rsh" find rsh?
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>> Adam C Powell IV wrote:
>>>> On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
>>>>> Adam C Powell IV wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> I'm running the Debian package of OpenMPI in a chroot (with /proc
>>>>>> mounted properly), and orte_init is failing as follows:
>>>>>> [snip]
>>>>>> What could be wrong? Does orterun not run in a chroot environment?
>>>>>> What more can I do to investigate further?
>>>>> Try running mpirun with the added options:
>>>>> -mca orte_debug 1 -mca pls_base_verbose 20
>>>>>
>>>>> Then send the output to the list.
>>>> Thanks! Here's the output:
>>>>
>>>> $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
>>>> [new-host-3:19201] mca: base: components_open: Looking for pls components
>>>> [new-host-3:19201] mca: base: components_open: distilling pls components
>>>> [new-host-3:19201] mca: base: components_open: accepting all pls components
>>>> [new-host-3:19201] mca: base: components_open: opening pls components
>>>> [new-host-3:19201] mca: base: components_open: found loaded component
>>>> gridengine[new-host-3:19201] mca: base: components_open: component
>>>> gridengine open function successful
>>>> [new-host-3:19201] mca: base: components_open: found loaded component proxy
>>>> [new-host-3:19201] mca: base: components_open: component proxy open
>>>> function successful
>>>> [new-host-3:19201] mca: base: components_open: found loaded component rsh
>>>> [new-host-3:19201] mca: base: components_open: component rsh open function
>>>> successful
>>>> [new-host-3:19201] mca: base: components_open: found loaded component slurm
>>>> [new-host-3:19201] mca: base: components_open: component slurm open
>>>> function successful
>>>> [new-host-3:19201] orte:base:select: querying component gridengine
>>>> [new-host-3:19201] pls:gridengine: NOT available for selection
>>>> [new-host-3:19201] orte:base:select: querying component proxy
>>>> [new-host-3:19201] orte:base:select: querying component rsh
>>>> [new-host-3:19201] orte:base:select: querying component slurm
>>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
>>>> runtime/orte_init_stage1.c at line 312
>>>> --------------------------------------------------------------------------
>>>> It looks like orte_init failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems. This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>> orte_pls_base_select failed
>>>> --> Returned value -1 instead of ORTE_SUCCESS
>>>>
>>>> --------------------------------------------------------------------------
>>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
>>>> runtime/orte_system_init.c at line 42
>>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
>>>> runtime/orte_init.c at line 52
>>>> --------------------------------------------------------------------------
>>>> Open RTE was unable to initialize properly. The error occured while
>>>> attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.
>>>> --------------------------------------------------------------------------
>>>>
>>>> -Adam
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users