Received from Brian Barrett on Tue, Aug 28, 2007 at 05:07:51PM EDT:
> On Aug 28, 2007, at 10:59 AM, Lev Givon wrote:
> > Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT:
> >> On Aug 27, 2007, at 3:14 PM, Lev Givon wrote:
> >>> I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate
> >>> Mac
> >>> client that I am using to submit jobs to the head (controller)
> >>> node of
> >>> the cluster. The cluster's compute nodes are all connected to the
> >>> head
> >>> node via a private network and are not running any firewalls. When I
> >>> try running jobs with mpirun directly on the cluster's head node,
> >>> they
> >>> execute successfully; if I attempt to submit the jobs from the
> >>> client
> >>> (which can run jobs on the cluster using the xgrid command line
> >>> tool)
> >>> with mpirun, however, they appear to hang indefinitely (i.e., a
> >>> job ID
> >>> is created, but the mpirun itself never returns or terminates).
> >>> Is it
> >>> nececessary to configure the firewall on the submission client to
> >>> grant access to the cluster head node in order to remotely submit
> >>> jobs
> >>> to the cluster's head node?
> >> Currently, every node on which an MPI process is launched must be
> >> able to open a connection to a random port on the machine running
> >> mpirun. So in your case, you'd have to configure the network on the
> >> cluster to be able to connect back to your workstation (and the
> >> workstation would have to allow connections from all your cluster
> >> nodes). Far from ideal, but it's what it is.
> >> Brian
> > Can this be avoided by submitting the "mpirun -n 10 myProg" command
> > directly to the controller node with the xgrid command line tool? For
> > some reason, sending the above command to the cluster results in a
> > "task: failed with status 255" error even though I can successfully
> > run other programs or commands to the cluster with the xgrid tool. I
> > know that OpenMPI on the cluster is running properly because I can run
> > programs with mpirun successfully when logged into the controller node
> > itself.
> Open MPI was designed to be the one calling XGrid's scheduling
> algorithm, so I'm pretty sure that you can't submit a job that just
> runs Open MPI's mpirun. That wasn't really in our original design
> space as an option.
Apart from employing some grid package with more features than Xgrid
(e.g., perhaps Sun GridEngine), is anyone aware of a mechanism that
would allow for the submission of MPI jobs to a cluster's head node
from remote submit hosts without having to provide every user with an
actual Unix account on the head node?