Thanks thats it!
 
Would have been straigth forward, but there is a lot of things to consider by setting up a cluster the first time - a lot to oversee.
 
Anyway thanks for your help.

>>> Ralph Castain <rhc@open-mpi.org> 18.11.2009 15:57 >>>
Bingo! This is why we ask for info on how you configure OMPI :-)

You need to rebuild OMPI with --enable-heterogeneous. Because there is additional overhead associated with running hetero configurations, and so few people do so, it is disabled by default.


On Nov 18, 2009, at 2:55 AM, Laurin Müller wrote:

Now i have the same openmpi versions. 1.3.2
 
recalulated on both nodes and it works again on each node seperatly:
 
node1:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
mpirun (Open MPI) 1.3.2
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4 /mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
20
1: pi =         0.798498008827023
2: pi =         0.773339953424083
3: pi =         0.747089984650041
0: pi =         0.822248040052981
pi =         3.141175986954128
node2 (PS3):
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
mpirun (Open MPI) 1.3.2
[...]
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
Input number of intervals:
20
0: pi =         1.595587993477064
1: pi =         1.545587993477064
pi =         3.141175986954128
BUT when i start it on node1 with more than 16 processes and hostfile. i get this errors:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi
--------------------------------------------------------------------------
This installation of Open MPI was configured without support for
heterogeneous architectures, but at least one node in the allocation
was detected to have a different architecture. The detected node was:
 
Node: bioclust
 
In order to operate in a heterogeneous environment, please reconfigure
Open MPI with --enable-heterogeneous.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  ompi_proc_set_arch failed
  --> Returned "Not supported" (-8) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1239] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1240] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1241] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1242] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1244] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1245] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1246] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1247] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1248] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1250] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1251] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1238] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[kasimir:12678] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1243] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1249] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1252] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1253] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 16 with PID 12678 on
node 10.4.1.23 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[bioclust:01236] 16 more processes have sent help message help-mpi-runtime / heterogeneous-support-unavailable
[bioclust:01236] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[bioclust:01236] 16 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
 
 
 


>>> Lenny Verkhovsky <lenny.verkhovsky@gmail.com> 17.11.2009 16:52 >>>
I noticed that you also have different versions of OMPI. You have 1.3.2 on node1 and 1.3 on node2.
can you try to put same versions of OMPI on both nodes.
can you also try running np 16 on node1 when you try running separately.
Lenny.

On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.mueller@umit.at> wrote:


>>> Ralph Castain 11/17/09 4:04 PM >>>

>Your cmd line is telling OMPI to run 17 processes. Since your hostfile indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this is node1?).
node1 has 16 Cores (4 x AMD Quad Core Processors)

node2 is the ps3 with two processors (slots)


>I would guess that the executable is compiled to run on the PS3 given your specified path, so I would >expect it to bomb on node1 - which is exactly what appears to be happening.
the executable is compiled on each node separately and lies at each node in the same directory

/mnt/projects/PS3Cluster/Benchmark/pi
on each node different directories are mounted. so there exists a separate executable file compiled at each node.

in the end i want to ran R on this cluster with Rmpi - as i get a similar problem there i rist wanted to try with an c programm.

with r happens the same thing it works when i start it on each node but if i want to start more than 16 processes on node one in exits.


On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:

Hi,
i want to build a cluster with openmpi.
2 nodes:
node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
both can connect with ssh to each other and to itself without passwd.
I can run the sample proramm pi.c on both nodes seperatly (see below). But if i try to start it on node1 with --hostfile option to use node 2 "remote" i got this error:
cluster@bioclust:~$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
my hostfile:
cluster@bioclust:~$ cat /etc/openmpi/openmpi-default-hostfile
10.4.23.107 slots=16
10.4.1.23 slots=2
i can see with top that the processors of node2 begin to work shortly, then it apports on node1.
I use this sample/test program:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
int i, n;
double h, pi, x;
int me, nprocs;
double piece;
/* --------------------------------------------------- */
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &me);
/* --------------------------------------------------- */
if (me == 0)
{
printf("%s", "Input number of intervals:\n");
scanf ("%d", &n);
}
/* --------------------------------------------------- */
MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
h = 1. / (double) n;
piece = 0.;
for (i=me+1; i <= n; i+=nprocs)
{
x = (i-1)*h;
piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
}
printf("%d: pi = %25.15f\n", me, piece);
/* --------------------------------------------------- */
MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
if (me == 0)
{
printf("pi = %25.15f\n", pi);
}
/* --------------------------------------------------- */
MPI_Finalize();
return 0;
}
it works on each node.
node1:
cluster@bioclust:~$ mpirun -np 4 /mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
20
0: pi = 0.822248040052981
2: pi = 0.773339953424083
3: pi = 0.747089984650041
1: pi = 0.798498008827023
pi = 3.141175986954128
node2:
cluster@kasimir:~$ mpirun -np 2 /mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
5
1: pi = 1.267463056905495
0: pi = 1.867463056905495
pi = 3.134926113810990
cluster@kasimir:~$
Thx in advance,
Laurin

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users