Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem calling mpirun from script invoked with mpirun
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-10-28 15:48:10


I'm afraid we have never really supported this kind of nested invocations of
mpirun. If it works with any version of OMPI, it is totally a fluke - it
might work one time, and then fail the next.

The problem is that we pass envars to the launched processes to control
their behavior, and these conflict with what mpirun needs. We have tried
various scrubbing mechanisms (i.e., having mpirun start out by scrubbing the
environment of envars that would have come from the initial mpirun, but they
all have the unfortunate possibility of removing parameters provided by the
user - and that can cause its own problems.

I don't know if we will ever support nested operations - occasionally, I do
give it some thought, but have yet to find a foolproof solution.

Ralph

On Wed, Oct 28, 2009 at 1:11 PM, Luke Shulenburger
<lshulenburger_at_[hidden]>wrote:

> Hello,
> I am having trouble with a script that calls mpi. Basically my
> problem distills to wanting to call a script with:
>
> mpirun -np # ./script.sh
>
> where script.sh looks like:
> #!/bin/bash
> mpirun -np 2 ./mpiprogram
>
> Whenever I invoke script.sh normally (as ./script.sh for instance) it
> works fine, but if I do mpirun -np 2 ./script.sh I get the following
> error:
>
> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file rml_oob_send.c at line 105
> [ppv.stanford.edu:08814] [[27860,1],0] could not get route to
> [[INVALID],INVALID]
> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file base/plm_base_proxy.c at line 86
>
> I have also tried running with mpirun -d to get some debugging info
> and it appears that the proctable is not being created for the second
> mpirun. The command hangs like so:
>
> [ppv.stanford.edu:08823] procdir:
> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0/0
> [ppv.stanford.edu:08823] jobdir:
> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/0
> [ppv.stanford.edu:08823] top: openmpi-sessions-sluke_at_[hidden]_0
> [ppv.stanford.edu:08823] tmp: /tmp
> [ppv.stanford.edu:08823] [[27855,0],0] node[0].name ppv daemon 0 arch
> ffc91200
> [ppv.stanford.edu:08823] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 1
> MPIR_proctable:
> (i, host, exe, pid) = (0, ppv.stanford.edu,
> /home/sluke/maintenance/openmpi-1.3.3/examples/./shell.sh, 8824)
> [ppv.stanford.edu:08825] procdir:
> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1/0
> [ppv.stanford.edu:08825] jobdir:
> /tmp/openmpi-sessions-sluke_at_[hidden]_0/27855/1
> [ppv.stanford.edu:08825] top: openmpi-sessions-sluke_at_[hidden]_0
> [ppv.stanford.edu:08825] tmp: /tmp
> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file rml_oob_send.c at line 105
> [ppv.stanford.edu:08825] [[27855,1],0] could not get route to
> [[INVALID],INVALID]
> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file base/plm_base_proxy.c at line 86
> [ppv.stanford.edu:08825] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 0
> MPIR_proctable:
>
>
> In this case, it does not matter what the ultimate mpiprogram I try to
> run is, the shell script fails in the same way regardless (I've tried
> the hello_f90 executable from the openmpi examples directory). Here
> are some details of my setup:
>
> I have built openmpi 1.3.3 with the intel fortran in c compilers
> (version 11.1). The machine uses rocks with the SGE scheduler, so I
> have run autoconf with ./configure --prefix=/home/sluke --with-sge,
> however this problem persists even if I am running on the head node
> outside of the scheduler. I am attaching the resulting config.log to
> this email as well as output to ompi_info --all and ifconfig. I hope
> this gives the experts on the list enough to go from, but I will be
> happy to provide any more information that might be helpful.
>
> Luke Shulenburger
> Geophysical Laboratory
> Carnegie Institution of Washington
>
>
> PS I have tried this on a machine with openmpi-1.2.6 and cannot
> reproduce the error, however on a second machine with openmpi-1.3.2 I
> have the same problem.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>