Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] collective communications broken on more than 4 cores
From: Vincent Loechner (loechner_at_[hidden])
Date: 2009-10-29 09:57:08


Hello to the list,

I came to a problem running a simple program with collective
communications, on a 6-core processors (6 local MPI processes).
It seems that the calls to collective communication are not
returning for some MPI processes, when the number of processes is
greater or equal to 5. It's reproduceable, on two different
architectures, with two different versions of OpenMPI (1.3.2 and
1.3.3). It was working correctly with OpenMPI version 1.2.7.

I just wrote a very simple test, making 1000 calls to MPI_Barrier().
Running on an istanbul processor (6-core AMD Opteron) :
$ uname -a
Linux istanbool 2.6.31-14-generic #46-Ubuntu SMP Tue Oct 13 16:47:28 UTC 2009
x86_64 GNU/Linux
with a OpenMPI ubuntu package, version 1.3.2.
Running with 5 or 6 MPI processes, it just hangs after a random
number of iterations, ranging from 3 to 600, and sometimes it
finishes correctly (about 1 time out of 8). Just ran :
'mpirun -n 6 ./testmpi'
Same behavior with more MPI processes.

I tried the '--mca coll_basic_priority 50' option, the program has
more chance to finish -about one time out of 2, but also deadlocks
the other time after a random number of iterations.

Without setting the coll_basic_priority option, I ran a debugger, and
found out that some processes are blocked in:
#0 0x00007f858f272f7a in opal_progress () from /usr/lib/libopen-pal.so.0
#1 0x00007f858f7524f5 in ?? () from /usr/lib/libmpi.so.0
#2 0x00007f8589e74c5a in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3 0x00007f8589e7cefa in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4 0x00007f858f767b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#5 0x0000000000400c10 in main (argc=1, argv=0x7fff9d59acf8) at testmpi.c:24

and the others in:
#0 0x00007f05799e933a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so
#1 0x00007f057dd22fba in opal_progress () from /usr/lib/libopen-pal.so.0
#2 0x00007f057e2024f5 in ?? () from /usr/lib/libmpi.so.0
#3 0x00007f0578924c5a in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4 0x00007f057892cefa in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#5 0x00007f057e217b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#6 0x0000000000400c10 in main (argc=1, argv=0x7fff1b55b4a8) at testmpi.c:24

Seems that other collective communications are broken, my original
program was blocked after a call to MPI_Allreduce.

I also made tests on a 4-core Intel core i7, openMPI version 1.3.3,
with exatly the same problem: calls to collective communication not
returning for some MPI processes when the number of processes is
greater or equal to 5.

Below, some technical details on my configuration, input file,
example outputs. The output of ompi_info --all is attached to this
mail.

Best regards,
--------------------------------------------------------------------------
Vincent LOECHNER | 0---0 | ICPS, LSIIT (UMR 7005),
 PhD | /| /| | Equipe INRIA CAMUS,
 Phone: +33 (0)368 85 45 37 | 0---0 | | Université de Strasbourg
 Fax : +33 (0)368 85 45 47 | | 0-|-0 | Pôle API, Bd. Sébastien Brant
                             | |/ |/ | F-67412 ILLKIRCH Cedex
 loechner_at_[hidden] | 0---0 | http://icps.u-strasbg.fr
--------------------------------------------------------------------------

Input program:
//-------------------- testmpi.c -----------------------------------
#include <stdio.h>
#include <mpi.h>
#define MCW MPI_COMM_WORLD

int main( int argc, char **argv )
{
        int n, r; /* number of processes, process rank */
        int i;

        MPI_Init( &argc, &argv );
        MPI_Comm_size( MCW, &n );
        MPI_Comm_rank( MCW, &r );

        for( i=0 ; i<1000 ; i++ )
        {
                printf( "(%d) %d\n", r, i ); fflush(stdout);
                MPI_Barrier( MCW );
        }

        MPI_Finalize();
        return( 0 );
}
//-------------------- testmpi.c -----------------------------------

Compilation line:
$ mpicc -O2 -Wall -g testmpi.c -o testmpi

GCC version :
$ mpicc --version
gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1

OpenMPI version : 1.3.2
$ ompi_info -v ompi full
                 Package: Open MPI buildd_at_crested Distribution
                Open MPI: 1.3.2
   Open MPI SVN revision: r21054
   Open MPI release date: Apr 21, 2009
                Open RTE: 1.3.2
   Open RTE SVN revision: r21054
   Open RTE release date: Apr 21, 2009
                    OPAL: 1.3.2
       OPAL SVN revision: r21054
       OPAL release date: Apr 21, 2009
            Ident string: 1.3.2

--------------- example run (I hit ^C after a while)--------------------
$ mpirun -n 6 ./testmpi
(0) 0
(0) 1
(0) 2
(0) 3
(1) 0
(1) 1
(1) 2
(2) 0
(2) 1
(2) 2
(2) 3
(3) 0
(3) 1
(3) 2
(4) 0
(4) 1
(4) 2
(4) 3
(5) 0
(5) 1
(5) 2
(5) 3
^Cmpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10466 on node istanbool exited on
signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
6 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

$ mpirun -n 6 ./testmpi
(0) 0
(0) 1
(0) 2
(0) 3
(0) 4
(0) 5
(0) 6
(0) 7
(0) 8
(0) 9
(1) 0
(1) 1
(1) 2
(1) 3
(1) 4
(1) 5
(1) 6
(1) 7
(1) 8
(1) 9
(2) 0
(2) 1
(2) 2
(2) 3
(2) 4
(2) 5
(2) 6
(2) 7
(2) 8
(2) 9
(3) 0
(3) 1
(3) 2
(3) 3
(3) 4
(3) 5
(3) 6
(3) 7
(3) 8
(4) 0
(4) 1
(4) 2
(4) 3
(4) 4
(4) 5
(4) 6
(4) 7
(4) 8
(4) 9
(5) 0
(5) 1
(5) 2
(5) 3
(5) 4
(5) 5
(5) 6
(5) 7
(5) 8
^Cmpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10473 on node istanbool exited on
signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
6 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

--------------- end example run --------------------