Hello to all,
I have also encountered a similar bug with MPI-IO
with Open MPI 1.3.1, reading a Code_Saturne preprocessed mesh file
(www.code-saturne.org). Reading the file can be done using 2 MPI-IO
modes, or one non-MPI-IO mode.
The first MPI-IO mode uses individual file pointers, and involves a
series of MPI_File_Read_all with all ranks using the same view (for
record headers), interlaced with MPI_File_Read_all with ranks using
different views (for record data, successive blocks being read by each
The second MPI-IO mode uses explicit file offsets, with
MPI_File_read_at_all instead of MPI_File_read_all.
Both MPI-IO modes seem to work fine with OpenMPI 1.2, MPICH 2,
and variants on IBM Blue Gene/L and P, as well as Bull Novascale,
but with OpenMPI 1.3.1, data read seems to be corrupt on at least
one file using the individual file pointers approach (though it
works well using explicit offsets).
The bug does not appear in unit tests, and it only appears after several
records are read on the case that does fail (on 2 ranks), so to
reproduce it with a simple program, I would have to extract the exact
file access patterns from the exact case which fails, which would
require a few extra hours of work.
If the bug is not reproduced in a simpler manner first, I will try
to build a simple program reproducing the bug within a week or 2,
but In the meantime, I just want to confirm Scott's observation
(hoping it is the same bug).
On Mon, 2009-04-06 at 16:03 -0400, users-request_at_[hidden] wrote:
> Date: Mon, 06 Apr 2009 12:16:18 -0600
> From: Scott Collis <sscollis_at_[hidden]>
> Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI
> To: users_at_[hidden]
> Message-ID: <B20E6603-EB8C-408F-83EF-B018D8A73660_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> I have been a user of MPI-IO for 4+ years and have a code that has run
> correctly with MPICH, MPICH2, and OpenMPI 1.2.*
> I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my
> MPI-IO generated output files are corrupted. I have not yet had a
> chance to debug this in detail, but it appears that
> MPI_File_write_all() commands are not placing information correctly on
> their file_view when running with more than 1 processor (everything is
> okay with -np 1).
> Note that I have observed the same incorrect behavior on both Linux
> and OS-X. I have also gone back and made sure that the same code
> works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident
> that something has been changed or broken as of OpenMPI 1.3.*. Just
> today, I checked out the SVN repository version of OpenMPI and built
> and tested my code with that and the results are incorrect just as for
> the 1.3.1 tarball.
> While I plan to continue to debug this and will try to put together a
> small test that demonstrates the issue, I thought that I would first
> send out this message to see if this might trigger a thought within
> the OpenMPI development team as to where this issue might be.
> Please let me know if you have any ideas as I would very much
> appreciate it!
> Thanks in advance,
> Scott Collis