Thanks for your response. The program that I have been using for testing
purposes is a simple hello:
#include <stdio.h>
#include <mpi.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <stdio.h>
main(int argc, char *argv)
{
char name[BUFSIZ];
int length;
int rank;
struct rlimit rlim;
FILE *output;
MPI_Init(&argc, &argv);
MPI_Get_processor_name(name, &length);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// while(1) {
printf("%s: hello world from rank %d\n", name, rank);
sleep(1);
// }
MPI_Finalize();
}
If I run this program not in a slurm environment I get the following
mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello
master: hello world from rank 1
wolf1: hello world from rank 0
wolf1: hello world from rank 2
master: hello world from rank 3
This is exactly what I expect. Now if I create a slurm environment using
the following:
srun -n 4 -A
The output of printenv|grep SLRUM gives me:
SLURM_NODELIST=master,wolf1
SLURM_SRUN_COMM_PORT=58929
SLURM_MEM_BIND_TYPE=
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_MEM_BIND_LIST=
SLURM_CPU_BIND_LIST=
SLURM_NNODES=2
SLURM_JOBID=66135
SLURM_TASKS_PER_NODE=2(x2)
SLURM_SRUN_COMM_HOST=master
SLURM_CPU_BIND_TYPE=
SLURM_MEM_BIND_VERBOSE=quiet
SLURM_NPROCS=4
This seems to indicate that both master and wolf1 have been allocated and
that each node should run 2 tasks, which is correct since both master and
wolf1 are dual processor machines.
Now if I run:
mpirun -np 4 -mca btl tcp,self ./hello
The output is:
master: hello world from rank 1
master: hello world from rank 2
master: hello world from rank 3
master: hello world from rank 0
All four processes are running on master and none on wolf1.
If I try the following and specify the hosts. I get the following error
message.
mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello
--------------------------------------------------------------------------
Some of the requested hosts are not included in the current allocation for
the
application:
./hello
The requested hosts were:
wolf1,master
Verify that you have mapped the allocated resources properly using the
--host specification.
--------------------------------------------------------------------------
[master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmgr_urm.c at
line 377
[master:28022] mpirun: spawn failed with errno=-2
I'm at a loss to figure out how to get this working correctly. Any help
would be greatly appreciated.
Bob
On 1/19/07, Ralph Castain <rhc_at_[hidden]> wrote:
>
> Open MPI and SLURM should work together just fine right out-of-the-box.
> The
> typical command progression is:
>
> srun -n x -A
> mpirun -n y .....
>
>
> If you are doing those commands and still see everything running on the
> head
> node, then two things could be happening:
>
> (a) you really aren't getting an allocation from slurm. Perhaps you don't
> have slurm setup correctly and aren't actually seeing the allocation in
> your
> environment. Do a "printenv | grep SLURM" and see if you find the
> following
> variables:
> SLURM_NPROCS=8
> SLURM_CPU_BIND_VERBOSE=quiet
> SLURM_CPU_BIND_TYPE=
> SLURM_CPU_BIND_LIST=
> SLURM_MEM_BIND_VERBOSE=quiet
> SLURM_MEM_BIND_TYPE=
> SLURM_MEM_BIND_LIST=
> SLURM_JOBID=47225
> SLURM_NNODES=2
> SLURM_NODELIST=odin[013-014]
> SLURM_TASKS_PER_NODE=4(x2)
> SLURM_SRUN_COMM_PORT=43206
> SLURM_SRUN_COMM_HOST=odin
>
> Obviously, the values will be different, but we really need the
> TASKS_PER_NODE and NODELIST ones to be there
>
> (b) the master node is being included in your nodelist and you aren't
> running enough mpi processes to need more nodes (i.e., the number of slots
> on the master node is greater than or equal to the num procs you
> requested).
> You can force Open MPI to not run on your master node by including
> "--nolocal" on your command line.
>
> Of course, if the master node is the only thing on the nodelist, this will
> cause mpirun to abort as there is nothing else for us to use.
>
> Hope that helps
> Ralph
>
>
> On 1/18/07 11:03 PM, "Robert Bicknell" <robbicknell_at_[hidden]> wrote:
>
> > I'm trying to get slurm and openmpi to work together on a debian, two
> > node cluster. Slurm and openmpi seem to work fine seperately, but when
> > I try to run a mpi program in a slurm allocation, all the processes get
> > run on the master node, and not distributed to the second node. What am
> > I doing wrong?
> >
> > Bob
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
|