Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Antw: Re: mpirun not working on more than one node
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-11-18 09:57:37


Bingo! This is why we ask for info on how you configure OMPI :-)

You need to rebuild OMPI with --enable-heterogeneous. Because there is additional overhead associated with running hetero configurations, and so few people do so, it is disabled by default.

On Nov 18, 2009, at 2:55 AM, Laurin Müller wrote:

> Now i have the same openmpi versions. 1.3.2
>
> recalulated on both nodes and it works again on each node seperatly:
>
> node1:
> cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
> mpirun (Open MPI) 1.3.2
> cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4 /mnt/projects/PS3Cluster/Benchmark/pi
> Input number of intervals:
> 20
> 1: pi = 0.798498008827023
> 2: pi = 0.773339953424083
> 3: pi = 0.747089984650041
> 0: pi = 0.822248040052981
> pi = 3.141175986954128
> node2 (PS3):
> root_at_kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
> mpirun (Open MPI) 1.3.2
> [...]
> root_at_kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
> Input number of intervals:
> 20
> 0: pi = 1.595587993477064
> 1: pi = 1.545587993477064
> pi = 3.141175986954128
> BUT when i start it on node1 with more than 16 processes and hostfile. i get this errors:
> cluster_at_bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi
> --------------------------------------------------------------------------
> This installation of Open MPI was configured without support for
> heterogeneous architectures, but at least one node in the allocation
> was detected to have a different architecture. The detected node was:
>
> Node: bioclust
>
> In order to operate in a heterogeneous environment, please reconfigure
> Open MPI with --enable-heterogeneous.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> ompi_proc_set_arch failed
> --> Returned "Not supported" (-8) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1239] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1240] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1241] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1242] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1244] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1245] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1246] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1247] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1248] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1250] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1251] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1238] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [kasimir:12678] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1243] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1249] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1252] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1253] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 16 with PID 12678 on
> node 10.4.1.23 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / heterogeneous-support-unavailable
> [bioclust:01236] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
>
>
>
>
>
> >>> Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]> 17.11.2009 16:52 >>>
> I noticed that you also have different versions of OMPI. You have 1.3.2 on node1 and 1.3 on node2.
> can you try to put same versions of OMPI on both nodes.
> can you also try running np 16 on node1 when you try running separately.
> Lenny.
>
> On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.mueller_at_[hidden]> wrote:
>
>
> >>> Ralph Castain 11/17/09 4:04 PM >>>
>
> >Your cmd line is telling OMPI to run 17 processes. Since your hostfile indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this is node1?).
> node1 has 16 Cores (4 x AMD Quad Core Processors)
>
> node2 is the ps3 with two processors (slots)
>
>
> >I would guess that the executable is compiled to run on the PS3 given your specified path, so I would >expect it to bomb on node1 - which is exactly what appears to be happening.
> the executable is compiled on each node separately and lies at each node in the same directory
>
> /mnt/projects/PS3Cluster/Benchmark/pi
> on each node different directories are mounted. so there exists a separate executable file compiled at each node.
>
> in the end i want to ran R on this cluster with Rmpi - as i get a similar problem there i rist wanted to try with an c programm.
>
> with r happens the same thing it works when i start it on each node but if i want to start more than 16 processes on node one in exits.
>
>
> On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:
>
>> Hi,
>> i want to build a cluster with openmpi.
>> 2 nodes:
>> node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
>> node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
>> both can connect with ssh to each other and to itself without passwd.
>> I can run the sample proramm pi.c on both nodes seperatly (see below). But if i try to start it on node1 with --hostfile option to use node 2 "remote" i got this error:
>> cluster_at_bioclust:~$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> my hostfile:
>> cluster_at_bioclust:~$ cat /etc/openmpi/openmpi-default-hostfile
>> 10.4.23.107 slots=16
>> 10.4.1.23 slots=2
>> i can see with top that the processors of node2 begin to work shortly, then it apports on node1.
>> I use this sample/test program:
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include "mpi.h"
>> int main(int argc, char *argv[])
>> {
>> int i, n;
>> double h, pi, x;
>> int me, nprocs;
>> double piece;
>> /* --------------------------------------------------- */
>> MPI_Init (&argc, &argv);
>> MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
>> MPI_Comm_rank (MPI_COMM_WORLD, &me);
>> /* --------------------------------------------------- */
>> if (me == 0)
>> {
>> printf("%s", "Input number of intervals:\n");
>> scanf ("%d", &n);
>> }
>> /* --------------------------------------------------- */
>> MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
>> /* --------------------------------------------------- */
>> h = 1. / (double) n;
>> piece = 0.;
>> for (i=me+1; i <= n; i+=nprocs)
>> {
>> x = (i-1)*h;
>> piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
>> }
>> printf("%d: pi = %25.15f\n", me, piece);
>> /* --------------------------------------------------- */
>> MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE,
>> MPI_SUM, 0, MPI_COMM_WORLD);
>> /* --------------------------------------------------- */
>> if (me == 0)
>> {
>> printf("pi = %25.15f\n", pi);
>> }
>> /* --------------------------------------------------- */
>> MPI_Finalize();
>> return 0;
>> }
>> it works on each node.
>> node1:
>> cluster_at_bioclust:~$ mpirun -np 4 /mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
>> 20
>> 0: pi = 0.822248040052981
>> 2: pi = 0.773339953424083
>> 3: pi = 0.747089984650041
>> 1: pi = 0.798498008827023
>> pi = 3.141175986954128
>> node2:
>> cluster_at_kasimir:~$ mpirun -np 2 /mnt/projects/PS3Cluster/Benchmark/pi
>> Input number of intervals:
>> 5
>> 1: pi = 1.267463056905495
>> 0: pi = 1.867463056905495
>> pi = 3.134926113810990
>> cluster_at_kasimir:~$
>> Thx in advance,
>> Laurin
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users