I am in the process of setting up a grid engine (SGE) cluster for running
Open MPI applications. I'll detail the set up below, but my current problem
is that this call to Span_multiple never seems to return.
// Spawn all of the children processes.
_intercomm = MPI::COMM_WORLD.Spawn_multiple( _nProc,
const_cast<const char **>(_command),
const_cast<const char ***>(_arg),
_maxProc, _info, 0, errCode );
I'm new to both SGE and MPI, which is making this problem difficult for me
I can schedule simple (non-MPI) jobs on the SGE grid with qsub.
I can use qsub to schedule multiple copies of a simple Hello World type of
application using mpirun spawn the processes in a script like this:
#$ -S /bin/sh
#$ -pe orte 4
#$ -j yes
mpirun -np 4 ./mpihello $*
That seems to work. The processes report the hostname where they were run,
and they appear to be scheduled on different machines in my SGE grid.
The problem is with a program, mpitest, that tries to use Spawn_multiple to
launch multiple child processes. The script that I submit to the SGE grid
looks like this:
#$ -S /bin/sh
#$ -pe orte 1-
#$ -j yes
The mpitest program is the one that calls Spawn_multiple. In this case, it
just tries to run multiple copies of itself. If I restrict my SGE
configuration so that the orte parallel environment has to run all jobs on a
single host, then mpitest runs to completion, spawning 4 "child" processes
that are scheduled via SGE to run on the same host as the root process. The
processes Send and Recv some messages, and the program exits.
If I permit SGE to schedule jobs on multiple hosts, then the child processes
appear to be scheduled and launched. (That is, I can see them as children
of the sge_execd and sge_shepherd processes on various machines.) But the
original call to Spawn_multiple doesn't appear to return in the root
mpitest. I assume that there's some problem setting up the communications
channel among the different processes, but it's possible that my mpitest
code is just buggy. I already tried disabling the firewall on all of the
machines. I'm not sure how else to get useful debug information at this
stage of the troubleshooting.
It would be great if someone could look at the attached code and just let me
know whether what I'm doing is horribly incorrect. If it should work, then
I can focus on systems and SGE configuration issues. If the code is broken
and really shouldn't work, then I'd like to fix that first, of course.