Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Antw: Re: mpirun not working on more than one node
From: Laurin Müller (laurin.mueller_at_[hidden])
Date: 2009-11-18 04:55:45


Now i have the same openmpi versions. 1.3.2
 
recalulated on both nodes and it works again on each node seperatly:
 
node1:
cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
mpirun (Open MPI) 1.3.2
cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ (
mailto:1.3.2cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ )
mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4
/mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
20
1: pi = 0.798498008827023
2: pi = 0.773339953424083
3: pi = 0.747089984650041
0: pi = 0.822248040052981
pi = 3.141175986954128
node2 (PS3):
root_at_kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
mpirun (Open MPI) 1.3.2
[...]
root_at_kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
Input number of intervals:
20
0: pi = 1.595587993477064
1: pi = 1.545587993477064
pi = 3.141175986954128
BUT when i start it on node1 with more than 16 processes and hostfile.
i get this errors:
cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile
/etc/openmpi/openmpi-default-hostfile -np 17
/mnt/projects/PS3Cluster/Benchmark/pi
--------------------------------------------------------------------------
This installation of Open MPI was configured without support for
heterogeneous architectures, but at least one node in the allocation
was detected to have a different architecture. The detected node was:
 
Node: bioclust
 
In order to operate in a heterogeneous environment, please reconfigure
Open MPI with --enable-heterogeneous.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  ompi_proc_set_arch failed
  --> Returned "Not supported" (-8) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1239] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1240] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1241] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1242] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1244] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1245] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1246] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
***
 An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1247] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1248] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1250] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1251] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1238] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[kasimir:12678] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1243] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1249] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1252] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1253] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 16 with PID 12678 on
node 10.4.1.23 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[bioclust:01236] 16 more processes have sent help message
help-mpi-runtime / heterogeneous-support-unavailable
[bioclust:01236] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[bioclust:01236] 16 more processes have sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
 
 
 

>>> Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]> 17.11.2009 16:52 >>>
I noticed that you also have different versions of OMPI. You have 1.3.2
on node1 and 1.3 on node2.
can you try to put same versions of OMPI on both nodes.
can you also try running np 16 on node1 when you try running
separately.
Lenny.

On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.mueller_at_[hidden]>
wrote:

>>> Ralph Castain 11/17/09 4:04 PM >>>
>Your cmd line is telling OMPI to run 17 processes. Since your hostfile
indicates that only 16 of them are to >run on 10.4.23.107 (which I
assume is your PS3 node?), 1 process is going to be run on 10.4.1.23 (I
assume >this is node1?).
node1 has 16 Cores (4 x AMD Quad Core Processors)

node2 is the ps3 with two processors (slots)

>I would guess that the executable is compiled to run on the PS3 given
your specified path, so I would >expect it to
 bomb on node1 - which is
exactly what appears to be happening.
the executable is compiled on each node separately and lies at each
node in the same directory
/mnt/projects/PS3Cluster/Benchmark/pi
on each node different directories are mounted. so there exists a
separate executable file compiled at each node.

in the end i want to ran R on this cluster with Rmpi - as i get a
similar problem there i rist wanted to try with an c programm.

with r happens the same thing it works when i start it on each node but
if i want to start more than 16 processes on node one in exits.

On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:

Hi,
i want to build a cluster with openmpi.
2 nodes:
node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
both can connect with ssh to each other and to itself without passwd.
I can run the sample proramm pi.c on both nodes seperatly (see below).
But if i try to start it on node1 with --hostfile option to use node 2
"remote" i got this error:
cluster_at_bioclust:~$ ( mailto:cluster_at_bioclust:%7E$ ) mpirun --hostfile
/etc/openmpi/openmpi-default-hostfile -np 17
/mnt/projects/PS3Cluster/Benchmark/pi
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
my hostfile:
cluster_at_bioclust:~$ ( mailto:cluster_at_bioclust:%7E$ ) cat
/etc/openmpi/openmpi-default-hostfile
10.4.23.107 slots=16
10.4.1.23 slots=2
i can see with top that the processors of node2 begin to work shortly,
then it apports on node1.
I use this sample/test program:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
int i, n;
double h, pi, x;
int me, nprocs;
double piece;
/* --------------------------------------------------- */
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank (MPI_COMM_WORLD, &me);
/* --------------------------------------------------- */
if (me == 0)
{
printf("%s", "Input number of intervals:\n");
scanf ("%d", &n);
}
/* --------------------------------------------------- */
MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
h = 1. / (double) n;
piece = 0.;
for (i=me+1; i <= n; i+=nprocs)
{
x = (i-1)*h;
piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
}
printf("%d: pi = %25.15f\n", me, piece);
/* --------------------------------------------------- */
MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
/* --------------------------------------------------- */
if (me == 0)
{
printf("pi = %25.15f\n", pi);
}
/* --------------------------------------------------- */
MPI_Finalize();
return 0;
}
it works on each node.
node1:
cluster_at_bioclust:~$ ( mailto:cluster_at_bioclust:%7E$ ) mpirun -np 4
/mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
20
0: pi = 0.822248040052981
2: pi = 0.773339953424083
3: pi = 0.747089984650041
1: pi = 0.798498008827023
pi = 3.141175986954128
node2:
cluster_at_kasimir:~$ ( mailto:cluster_at_kasimir:%7E$ ) mpirun -np 2
/mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
5
1: pi = 1.267463056905495
0: pi = 1.867463056905495
pi = 3.134926113810990
cluster_at_kasimir:~$ ( mailto:cluster_at_kasimir:%7E$ )
Thx in advance,
Laurin

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users


  • text/html attachment: HTML