I'm afraid we have never really supported this kind of nested invocations of mpirun. If it works with any version of OMPI, it is totally a fluke - it might work one time, and then fail the next.

The problem is that we pass envars to the launched processes to control their behavior, and these conflict with what mpirun needs. We have tried various scrubbing mechanisms (i.e., having mpirun start out by scrubbing the environment of envars that would have come from the initial mpirun, but they all have the unfortunate possibility of removing parameters provided by the user - and that can cause its own problems.

I don't know if we will ever support nested operations - occasionally, I do give it some thought, but have yet to find a foolproof solution.

Ralph


On Wed, Oct 28, 2009 at 1:11 PM, Luke Shulenburger <lshulenburger@gmail.com> wrote:
Hello,
I am having trouble with a script that calls mpi.  Basically my
problem distills to wanting to call a script with:

mpirun -np # ./script.sh

where script.sh looks like:
#!/bin/bash
mpirun -np 2 ./mpiprogram

Whenever I invoke script.sh normally (as ./script.sh for instance) it
works fine, but if I do mpirun -np 2 ./script.sh I get the following
error:

[ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact information is
unknown in file rml_oob_send.c at line 105
[ppv.stanford.edu:08814] [[27860,1],0] could not get route to
[[INVALID],INVALID]
[ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact information is
unknown in file base/plm_base_proxy.c at line 86

I have also tried running with mpirun -d to get some debugging info
and it appears that the proctable is not being created for the second
mpirun.  The command hangs like so:

[ppv.stanford.edu:08823] procdir:
/tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/0/0
[ppv.stanford.edu:08823] jobdir:
/tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/0
[ppv.stanford.edu:08823] top: openmpi-sessions-sluke@ppv.stanford.edu_0
[ppv.stanford.edu:08823] tmp: /tmp
[ppv.stanford.edu:08823] [[27855,0],0] node[0].name ppv daemon 0 arch ffc91200
[ppv.stanford.edu:08823] Info: Setting up debugger process table for
applications
 MPIR_being_debugged = 0
 MPIR_debug_state = 1
 MPIR_partial_attach_ok = 1
 MPIR_i_am_starter = 0
 MPIR_proctable_size = 1
 MPIR_proctable:
   (i, host, exe, pid) = (0, ppv.stanford.edu,
/home/sluke/maintenance/openmpi-1.3.3/examples/./shell.sh, 8824)
[ppv.stanford.edu:08825] procdir:
/tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/1/0
[ppv.stanford.edu:08825] jobdir:
/tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/1
[ppv.stanford.edu:08825] top: openmpi-sessions-sluke@ppv.stanford.edu_0
[ppv.stanford.edu:08825] tmp: /tmp
[ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact information is
unknown in file rml_oob_send.c at line 105
[ppv.stanford.edu:08825] [[27855,1],0] could not get route to
[[INVALID],INVALID]
[ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact information is
unknown in file base/plm_base_proxy.c at line 86
[ppv.stanford.edu:08825] Info: Setting up debugger process table for
applications
 MPIR_being_debugged = 0
 MPIR_debug_state = 1
 MPIR_partial_attach_ok = 1
 MPIR_i_am_starter = 0
 MPIR_proctable_size = 0
 MPIR_proctable:


In this case, it does not matter what the ultimate mpiprogram I try to
run is, the shell script fails in the same way regardless (I've tried
the hello_f90 executable from the openmpi examples directory).  Here
are some details of my setup:

I have built openmpi 1.3.3 with the intel fortran in c compilers
(version 11.1).  The machine uses rocks with the SGE scheduler, so I
have run autoconf with ./configure --prefix=/home/sluke --with-sge,
however this problem persists even if I am running on the head node
outside of the scheduler.  I am attaching the resulting config.log to
this email as well as output to ompi_info --all and ifconfig.  I hope
this gives the experts on the list enough to go from, but I will be
happy to provide any more information that might be helpful.

Luke Shulenburger
Geophysical Laboratory
Carnegie Institution of Washington


PS I have tried this on a machine with openmpi-1.2.6 and cannot
reproduce the error, however on a second machine with openmpi-1.3.2 I
have the same problem.

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users