Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Robert Bicknell (robbicknell_at_[hidden])
Date: 2007-01-19 13:50:21


Thanks for your response. The program that I have been using for testing
purposes is a simple hello:

#include <stdio.h>

#include <mpi.h>

#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <stdio.h>
main(int argc, char *argv)
{
  char name[BUFSIZ];
  int length;
  int rank;
  struct rlimit rlim;
  FILE *output;

  MPI_Init(&argc, &argv);
  MPI_Get_processor_name(name, &length);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  rank = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

// while(1) {
  printf("%s: hello world from rank %d\n", name, rank);
  sleep(1);
// }
  MPI_Finalize();
}

If I run this program not in a slurm environment I get the following

mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello

master: hello world from rank 1
wolf1: hello world from rank 0
wolf1: hello world from rank 2
master: hello world from rank 3

This is exactly what I expect. Now if I create a slurm environment using
the following:

srun -n 4 -A

The output of printenv|grep SLRUM gives me:

SLURM_NODELIST=master,wolf1
SLURM_SRUN_COMM_PORT=58929
SLURM_MEM_BIND_TYPE=
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_MEM_BIND_LIST=
SLURM_CPU_BIND_LIST=
SLURM_NNODES=2
SLURM_JOBID=66135
SLURM_TASKS_PER_NODE=2(x2)
SLURM_SRUN_COMM_HOST=master
SLURM_CPU_BIND_TYPE=
SLURM_MEM_BIND_VERBOSE=quiet
SLURM_NPROCS=4

This seems to indicate that both master and wolf1 have been allocated and
that each node should run 2 tasks, which is correct since both master and
wolf1 are dual processor machines.

Now if I run:

mpirun -np 4 -mca btl tcp,self ./hello

The output is:

master: hello world from rank 1
master: hello world from rank 2
master: hello world from rank 3
master: hello world from rank 0

All four processes are running on master and none on wolf1.

If I try the following and specify the hosts. I get the following error
message.

mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello

--------------------------------------------------------------------------
Some of the requested hosts are not included in the current allocation for
the
application:
  ./hello
The requested hosts were:
  wolf1,master

Verify that you have mapped the allocated resources properly using the
--host specification.
--------------------------------------------------------------------------
[master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmgr_urm.c at
line 377
[master:28022] mpirun: spawn failed with errno=-2

I'm at a loss to figure out how to get this working correctly. Any help
would be greatly appreciated.

Bob

On 1/19/07, Ralph Castain <rhc_at_[hidden]> wrote:
>
> Open MPI and SLURM should work together just fine right out-of-the-box.
> The
> typical command progression is:
>
> srun -n x -A
> mpirun -n y .....
>
>
> If you are doing those commands and still see everything running on the
> head
> node, then two things could be happening:
>
> (a) you really aren't getting an allocation from slurm. Perhaps you don't
> have slurm setup correctly and aren't actually seeing the allocation in
> your
> environment. Do a "printenv | grep SLURM" and see if you find the
> following
> variables:
> SLURM_NPROCS=8
> SLURM_CPU_BIND_VERBOSE=quiet
> SLURM_CPU_BIND_TYPE=
> SLURM_CPU_BIND_LIST=
> SLURM_MEM_BIND_VERBOSE=quiet
> SLURM_MEM_BIND_TYPE=
> SLURM_MEM_BIND_LIST=
> SLURM_JOBID=47225
> SLURM_NNODES=2
> SLURM_NODELIST=odin[013-014]
> SLURM_TASKS_PER_NODE=4(x2)
> SLURM_SRUN_COMM_PORT=43206
> SLURM_SRUN_COMM_HOST=odin
>
> Obviously, the values will be different, but we really need the
> TASKS_PER_NODE and NODELIST ones to be there
>
> (b) the master node is being included in your nodelist and you aren't
> running enough mpi processes to need more nodes (i.e., the number of slots
> on the master node is greater than or equal to the num procs you
> requested).
> You can force Open MPI to not run on your master node by including
> "--nolocal" on your command line.
>
> Of course, if the master node is the only thing on the nodelist, this will
> cause mpirun to abort as there is nothing else for us to use.
>
> Hope that helps
> Ralph
>
>
> On 1/18/07 11:03 PM, "Robert Bicknell" <robbicknell_at_[hidden]> wrote:
>
> > I'm trying to get slurm and openmpi to work together on a debian, two
> > node cluster. Slurm and openmpi seem to work fine seperately, but when
> > I try to run a mpi program in a slurm allocation, all the processes get
> > run on the master node, and not distributed to the second node. What am
> > I doing wrong?
> >
> > Bob
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>