Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun not working on more than one node
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-11-17 10:52:17


I noticed that you also have different versions of OMPI. You have 1.3.2 on
node1 and 1.3 on node2.
can you try to put same versions of OMPI on both nodes.
can you also try running np 16 on node1 when you try running separately.
Lenny.

On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.mueller_at_[hidden]>wrote:

>
>
> >>> Ralph Castain 11/17/09 4:04 PM >>>
>
> >Your cmd line is telling OMPI to run 17 processes. Since your hostfile
> indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is
> your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this
> is node1?).
> node1 has 16 Cores (4 x AMD Quad Core Processors)
>
> node2 is the ps3 with two processors (slots)
>
>
> >I would guess that the executable is compiled to run on the PS3 given your
> specified path, so I would >expect it to bomb on node1 - which is exactly
> what appears to be happening.
> the executable is compiled on each node separately and lies at each node in
> the same directory
>
> /mnt/projects/PS3Cluster/Benchmark/pi
> on each node different directories are mounted. so there exists a separate
> executable file compiled at each node.
>
> in the end i want to ran R on this cluster with Rmpi - as i get a similar
> problem there i rist wanted to try with an c programm.
>
> with r happens the same thing it works when i start it on each node but if
> i want to start more than 16 processes on node one in exits.
>
>
> On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:
>
> Hi,
>
> i want to build a cluster with openmpi.
>
> 2 nodes:
> node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
> node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
>
> both can connect with ssh to each other and to itself without passwd.
>
> I can run the sample proramm pi.c on both nodes seperatly (see below). But
> if i try to start it on node1 with --hostfile option to use node 2 "remote"
> i got this error:
>
> cluster_at_bioclust:~$ <cluster_at_bioclust:%7E$> mpirun --hostfile
> /etc/openmpi/openmpi-default-hostfile -np 17
> /mnt/projects/PS3Cluster/Benchmark/pi
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> my hostfile:
> cluster_at_bioclust:~$ <cluster_at_bioclust:%7E$> cat
> /etc/openmpi/openmpi-default-hostfile
> 10.4.23.107 slots=16
> 10.4.1.23 slots=2
> i can see with top that the processors of node2 begin to work shortly, then
> it apports on node1.
>
> I use this sample/test program:
> #include <stdio.h>
> #include <stdlib.h>
> #include "mpi.h"
> int main(int argc, char *argv[])
> {
> int i, n;
> double h, pi, x;
> int me, nprocs;
> double piece;
> /* --------------------------------------------------- */
> MPI_Init (&argc, &argv);
> MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
> MPI_Comm_rank (MPI_COMM_WORLD, &me);
> /* --------------------------------------------------- */
> if (me == 0)
> {
> printf("%s", "Input number of intervals:\n");
> scanf ("%d", &n);
> }
> /* --------------------------------------------------- */
> MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
> /* --------------------------------------------------- */
> h = 1. / (double) n;
> piece = 0.;
> for (i=me+1; i <= n; i+=nprocs)
> {
> x = (i-1)*h;
> piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
> }
> printf("%d: pi = %25.15f\n", me, piece);
> /* --------------------------------------------------- */
> MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE,
> MPI_SUM, 0, MPI_COMM_WORLD);
> /* --------------------------------------------------- */
> if (me == 0)
> {
> printf("pi = %25.15f\n", pi);
> }
> /* --------------------------------------------------- */
> MPI_Finalize();
> return 0;
> }
> it works on each node.
> node1:
> cluster_at_bioclust:~$ <cluster_at_bioclust:%7E$> mpirun -np 4
> /mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
> 20
> 0: pi = 0.822248040052981
> 2: pi = 0.773339953424083
> 3: pi = 0.747089984650041
> 1: pi = 0.798498008827023
> pi = 3.141175986954128
>
> node2:
> cluster_at_kasimir:~$ <cluster_at_kasimir:%7E$> mpirun -np 2
> /mnt/projects/PS3Cluster/Benchmark/pi
> Input number of intervals:
> 5
> 1: pi = 1.267463056905495
> 0: pi = 1.867463056905495
> pi = 3.134926113810990
> cluster_at_kasimir:~$ <cluster_at_kasimir:%7E$>
>
> Thx in advance,
> Laurin
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>