Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (bbarrett_at_[hidden])
Date: 2007-08-28 17:07:51


On Aug 28, 2007, at 10:59 AM, Lev Givon wrote:

> Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT:
>> On Aug 27, 2007, at 3:14 PM, Lev Givon wrote:
>>
>>> I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate
>>> Mac
>>> client that I am using to submit jobs to the head (controller)
>>> node of
>>> the cluster. The cluster's compute nodes are all connected to the
>>> head
>>> node via a private network and are not running any firewalls. When I
>>> try running jobs with mpirun directly on the cluster's head node,
>>> they
>>> execute successfully; if I attempt to submit the jobs from the
>>> client
>>> (which can run jobs on the cluster using the xgrid command line
>>> tool)
>>> with mpirun, however, they appear to hang indefinitely (i.e., a
>>> job ID
>>> is created, but the mpirun itself never returns or terminates).
>>> Is it
>>> nececessary to configure the firewall on the submission client to
>>> grant access to the cluster head node in order to remotely submit
>>> jobs
>>> to the cluster's head node?
>>
>> Currently, every node on which an MPI process is launched must be
>> able to open a connection to a random port on the machine running
>> mpirun. So in your case, you'd have to configure the network on the
>> cluster to be able to connect back to your workstation (and the
>> workstation would have to allow connections from all your cluster
>> nodes). Far from ideal, but it's what it is.
>>
>> Brian
>
> Can this be avoided by submitting the "mpirun -n 10 myProg" command
> directly to the controller node with the xgrid command line tool? For
> some reason, sending the above command to the cluster results in a
> "task: failed with status 255" error even though I can successfully
> run other programs or commands to the cluster with the xgrid tool. I
> know that OpenMPI on the cluster is running properly because I can run
> programs with mpirun successfully when logged into the controller node
> itself.

Open MPI was designed to be the one calling XGrid's scheduling
algorithm, so I'm pretty sure that you can't submit a job that just
runs Open MPI's mpirun. That wasn't really in our original design
space as an option.

Brian