Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] possible bug exercised by mpi4py
From: Lisandro Dalcin (dalcinl_at_[hidden])
Date: 2012-05-23 18:22:47


On 23 May 2012 19:04, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Thanks for all the info!
>
> But still, can we get a copy of the test in C?  That would make it significantly easier for us to tell if there is a problem with Open MPI -- mainly because we don't know anything about the internals of mpi4py.
>

FYI, this test ran fine with previous (but recent, let say 1.5.4)
OpenMPI versions, but fails with 1.6. The test also runs fine with
MPICH2.

Sorry for the delay, but writing the test in C takes some time
compared to Python. Also, it is a bit tiring for me to recode my tests
to C everytime a new issue shows up with code I'm confident about, but
I understand you really need something reproducible, so here you have.

Find attached a C version of the test. See the output below, the test
runs fine and shows the expected output for np=2,3,4,6,7 but something
funny happens for np=5.

[dalcinl_at_trantor tmp]$ mpicc allgather.c
[dalcinl_at_trantor tmp]$ mpiexec -n 2 ./a.out
[0] - [0] a
[1] - [0] a
[dalcinl_at_trantor tmp]$ mpiexec -n 3 ./a.out
[0] - [0] ab
[1] - [0] a
[2] - [1] a
[dalcinl_at_trantor tmp]$ mpiexec -n 4 ./a.out
[3] - [1] ab
[0] - [0] ab
[1] - [1] ab
[2] - [0] ab
[dalcinl_at_trantor tmp]$ mpiexec -n 6 ./a.out
[4] - [1] abc
[5] - [2] abc
[0] - [0] abc
[1] - [1] abc
[2] - [2] abc
[3] - [0] abc
[dalcinl_at_trantor tmp]$ mpiexec -n 7 ./a.out
[5] - [2] abc
[6] - [3] abc
[0] - [0] abcd
[1] - [1] abcd
[2] - [2] abcd
[3] - [0] abc
[4] - [1] abc
[dalcinl_at_trantor tmp]$ mpiexec -n 5 ./a.out
[trantor:13791] *** An error occurred in MPI_Allgatherv
[trantor:13791] *** on communicator
[trantor:13791] *** MPI_ERR_COUNT: invalid count argument
[trantor:13791] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpiexec has exited due to process rank 2 with PID 13789 on
node trantor exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[trantor:13786] 2 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[trantor:13786] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages

-- 
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169