Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Incorrect results with MPI-IO with OpenMPI v1.3.0 andbeyond
From: E.T.A.vanderWeide_at_[hidden]
Date: 2010-04-20 04:16:22


Hi Scott,

I find the same behavior for the test program I posted a couple of days ago. It works fine in combination with OpenMPI v1.2, but it produces incorrect results for v1.3 and v1.4. I also agree with your suggestion that something is wrong with the offsets, because for my test program both processor 0 and 1 read the same data, while processor 1 should read the data stored after the data read by processor 0.

Regards,

Edwin van der Weide

-----Original Message-----
From: users-bounces_at_[hidden] on behalf of Samuel Collis
Sent: Mon 4/19/2010 10:06 PM
To: users_at_[hidden]
Subject: [OMPI users] Incorrect results with MPI-IO with OpenMPI v1.3.0 andbeyond
 
Hi all,

Around a year ago, I posted the attached note regarding apparent incorrect file output results when using OpenMPI >= 1.3.0. I was requested that I generate a small, self contained bit of code that demonstrates the issue. I have attached that code to this posting (mpiio.cpp).

You can build this with:

  mpicxx mpiio.cpp -o mpiio

And I execute with the command:

sh-3.2$ mpiexec -n 1 ~/dgm/src/mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

sh-3.2$ mpiexec -n 2 ~/dgm/src/mpiio; od -e mpi.out
0000000 1.200000000000000e+01 1.300000000000000e+01
0000020 1.400000000000000e+01 1.500000000000000e+01
0000040 1.600000000000000e+01 1.700000000000000e+01
0000060 1.800000000000000e+01 1.900000000000000e+01
0000100 2.000000000000000e+01 2.100000000000000e+01
0000120 2.200000000000000e+01 2.300000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Note that the program should write out doubles 0-23 and on one processor this is
true. However, for n=2, it incorrectly writes rank 2's information overtop
of rank 1's stuff.

For larger problems it sometimes also drops information -- i.e. One rank
doesn't even write data at all. I suspect that the problems are closely
related. So see this behavior, use 100 elements (instead of the default 2)

mpiexec -n 4 ~/dgm/src/mpiio 100; ls -l mpi.out
-rw-r----- 1 user user 2400 Apr 19 12:19 mpi.out

mpiexec -n 1 ~/dgm/src/mpiio 100; ls -l mpi.out
-rw-r----- 1 user user 9600 Apr 19 12:19 mpi.out

Note how the -n 4 file is too small.

Note that with OpenMPI 1.2.7, I have verified that we get the correct
results:

$ mpiexec -n 1 mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

$ mpiexec -n 2 mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Finally, just to prove that it is OpenMPI related, I build the latest MPICH2
with the results:

$ ~/local/mpich2/bin/mpiexec -n 1 mpiio-mpich2; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

$ ~/local/mpich2/bin/mpiexec -n 2 mpiio-mpich2; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Clearly something is wrong (perhaps with file pointer/offsets). Hope that this helps,

Scott

Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1
From: Scott Collis (sscollis_at_[hidden])
Date: 2009-04-06 14:16:18

I have been a user of MPI-IO for 4+ years and have a code that has run

correctly with MPICH, MPICH2, and OpenMPI 1.2.*

I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my
MPI-IO generated output files are corrupted. I have not yet had a
chance to debug this in detail, but it appears that
MPI_File_write_all() commands are not placing information correctly on
their file_view when running with more than 1 processor (everything is
okay with -np 1).

Note that I have observed the same incorrect behavior on both Linux
and OS-X. I have also gone back and made sure that the same code
works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident
that something has been changed or broken as of OpenMPI 1.3.*. Just
today, I checked out the SVN repository version of OpenMPI and built
and tested my code with that and the results are incorrect just as for
the 1.3.1 tarball.

While I plan to continue to debug this and will try to put together a
small test that demonstrates the issue, I thought that I would first
send out this message to see if this might trigger a thought within
the OpenMPI development team as to where this issue might be.

Please let me know if you have any ideas as I would very much
appreciate it!

Thanks in advance,

Scott

-- 
Scott Collis 
sscollis_at_[hidden]