Subject: Re: [OMPI users] mpirun not working on more than one node
From: Laurin Müller (laurin.mueller_at_[hidden])
Date: 2009-11-17 10:45:32

>>> Ralph Castain 11/17/09 4:04 PM >>>
>Your cmd line is telling OMPI to run 17 processes. Since your hostfile
indicates that only 16 of them are to >run on (which I
assume is your PS3 node?), 1 process is going to be run on (I
assume >this is node1?).
node1 has 16 Cores (4 x AMD Quad Core Processors)

node2 is the ps3 with two processors (slots)

>I would guess that the executable is compiled to run on the PS3 given
your specified path, so I would >expect it to bomb on node1 - which is
exactly what appears to be happening.
the executable is compiled on each node separately and lies at each node
in the same directory
on each node different directories are mounted. so there exists a
separate executable file compiled at each node.

in the end i want to ran R on this cluster with Rmpi - as i get a
similar problem there i rist wanted to try with an c programm.

with r happens the same thing it works when i start it on each node but
if i want to start more than 16 processes on node one in exits.

On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:

i want to build a cluster with openmpi.
2 nodes:
node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
both can connect with ssh to each other and to itself without passwd.
I can run the sample proramm pi.c on both nodes seperatly (see below).
But if i try to start it on node1 with --hostfile option to use node 2
"remote" i got this error:
cluster_at_bioclust:~$ mpirun --hostfile
/etc/openmpi/openmpi-default-hostfile -np 17
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

my hostfile:
cluster_at_bioclust:~$ cat /etc/openmpi/openmpi-default-hostfile slots=16 slots=2

i can see with top that the processors of node2 begin to work shortly,
then it apports on node1.
I use this sample/test program:
#include "mpi.h"
int main(int argc, char *argv[])
      int i, n;
      double h, pi, x;
      int me, nprocs;
      double piece;
/* --------------------------------------------------- */
      MPI_Init (&argc, &argv);
      MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
      MPI_Comm_rank (MPI_COMM_WORLD, &me);
/* --------------------------------------------------- */
      if (me == 0)
         printf("%s", "Input number of intervals:\n");
         scanf ("%d", &n);
/* --------------------------------------------------- */
      MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
      h = 1. / (double) n;
      piece = 0.;
      for (i=me+1; i <= n; i+=nprocs)
           x = (i-1)*h;
           piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
      printf("%d: pi = %25.15f\n", me, piece);
/* --------------------------------------------------- */
      MPI_Reduce (&piece, *, 1, MPI_DOUBLE,
                  MPI_SUM, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
      if (me == 0)
         printf("pi = %25.15f\n", pi);
/* --------------------------------------------------- */
      return 0;

it works on each node.
cluster_at_bioclust:~$ mpirun -np 4
/mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
0: pi = 0.822248040052981
2: pi = 0.773339953424083
3: pi = 0.747089984650041
1: pi = 0.798498008827023
pi = 3.141175986954128
cluster_at_kasimir:~$ mpirun -np 2 /mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
1: pi = 1.267463056905495
0: pi = 1.867463056905495
pi = 3.13
Thx in advance,


