Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] possible bug exercised by mpi4py
From: Bennet Fauber (bennet_at_[hidden])
Date: 2012-05-23 21:12:16


On Wed, 23 May 2012, Lisandro Dalcin wrote:

> On 23 May 2012 19:04, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> Thanks for all the info!
>>
>> But still, can we get a copy of the test in C?  That would make it significantly easier for us to tell if there is a problem with Open MPI -- mainly because we don't know anything about the internals of mpi4py.
>
> FYI, this test ran fine with previous (but recent, let say 1.5.4)
> OpenMPI versions, but fails with 1.6. The test also runs fine with
> MPICH2.

I compiled the C example Lisandro provided using openmpi/1.4.3 compiled
against the Intel 11.0 compilers, and it ran the first time. I then
recompiled using gcc 4.6.2 and openmpi 1.4.4, and it provided the
following errors:

$ mpirun -np 5 a.out
[hostname:6601] *** An error occurred in MPI_Allgatherv
[hostname:6601] *** on communicator
[hostname:6601] *** MPI_ERR_COUNT: invalid count argument
[hostname:6601] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 6601 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

I then recompiled using the Intel compilers, and it runs without error 10
out of 10 times.

I then recompiled using the gcc 4.6.2/openmpi 1.4.4 combination, and it
fails consistently.

On the second and subsequent tries, it provides the following additional
errors:

$ mpirun -np 5 a.out
[hostname:7168] *** An error occurred in MPI_Allgatherv
[hostname:7168] *** on communicator
[hostname:7168] *** MPI_ERR_COUNT: invalid count argument
[hostname:7168] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 7168 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[hostname:07163] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[hostname:07163] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Not sure if that information is helpful or not.

I am still completely puzzled why the number 5 is magic....

                         -- bennet