Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Incorrect results with MPI-IO with OpenMPI v1.3.0 and beyond
From: Samuel Collis (sscollis_at_[hidden])
Date: 2010-04-19 16:06:05


Hi all,

Around a year ago, I posted the attached note regarding apparent incorrect file output results when using OpenMPI >= 1.3.0. I was requested that I generate a small, self contained bit of code that demonstrates the issue. I have attached that code to this posting (mpiio.cpp).

You can build this with:

  mpicxx mpiio.cpp -o mpiio

And I execute with the command:

sh-3.2$ mpiexec -n 1 ~/dgm/src/mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

sh-3.2$ mpiexec -n 2 ~/dgm/src/mpiio; od -e mpi.out
0000000 1.200000000000000e+01 1.300000000000000e+01
0000020 1.400000000000000e+01 1.500000000000000e+01
0000040 1.600000000000000e+01 1.700000000000000e+01
0000060 1.800000000000000e+01 1.900000000000000e+01
0000100 2.000000000000000e+01 2.100000000000000e+01
0000120 2.200000000000000e+01 2.300000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Note that the program should write out doubles 0-23 and on one processor this is
true. However, for n=2, it incorrectly writes rank 2's information overtop
of rank 1's stuff.

For larger problems it sometimes also drops information -- i.e. One rank
doesn't even write data at all. I suspect that the problems are closely
related. So see this behavior, use 100 elements (instead of the default 2)

mpiexec -n 4 ~/dgm/src/mpiio 100; ls -l mpi.out
-rw-r----- 1 user user 2400 Apr 19 12:19 mpi.out

mpiexec -n 1 ~/dgm/src/mpiio 100; ls -l mpi.out
-rw-r----- 1 user user 9600 Apr 19 12:19 mpi.out

Note how the -n 4 file is too small.

Note that with OpenMPI 1.2.7, I have verified that we get the correct
results:

$ mpiexec -n 1 mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

$ mpiexec -n 2 mpiio; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Finally, just to prove that it is OpenMPI related, I build the latest MPICH2
with the results:

$ ~/local/mpich2/bin/mpiexec -n 1 mpiio-mpich2; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

$ ~/local/mpich2/bin/mpiexec -n 2 mpiio-mpich2; od -e mpi.out
0000000 0.000000000000000e+00 1.000000000000000e+00
0000020 2.000000000000000e+00 3.000000000000000e+00
0000040 4.000000000000000e+00 5.000000000000000e+00
0000060 6.000000000000000e+00 7.000000000000000e+00
0000100 8.000000000000000e+00 9.000000000000000e+00
0000120 1.000000000000000e+01 1.100000000000000e+01
0000140 1.200000000000000e+01 1.300000000000000e+01
0000160 1.400000000000000e+01 1.500000000000000e+01
0000200 1.600000000000000e+01 1.700000000000000e+01
0000220 1.800000000000000e+01 1.900000000000000e+01
0000240 2.000000000000000e+01 2.100000000000000e+01
0000260 2.200000000000000e+01 2.300000000000000e+01
0000300

Clearly something is wrong (perhaps with file pointer/offsets). Hope that this helps,

Scott

Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1
From: Scott Collis (sscollis_at_[hidden])
Date: 2009-04-06 14:16:18

I have been a user of MPI-IO for 4+ years and have a code that has run

correctly with MPICH, MPICH2, and OpenMPI 1.2.*

I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my
MPI-IO generated output files are corrupted. I have not yet had a
chance to debug this in detail, but it appears that
MPI_File_write_all() commands are not placing information correctly on
their file_view when running with more than 1 processor (everything is
okay with -np 1).

Note that I have observed the same incorrect behavior on both Linux
and OS-X. I have also gone back and made sure that the same code
works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident
that something has been changed or broken as of OpenMPI 1.3.*. Just
today, I checked out the SVN repository version of OpenMPI and built
and tested my code with that and the results are incorrect just as for
the 1.3.1 tarball.

While I plan to continue to debug this and will try to put together a
small test that demonstrates the issue, I thought that I would first
send out this message to see if this might trigger a thought within
the OpenMPI development team as to where this issue might be.

Please let me know if you have any ideas as I would very much
appreciate it!

Thanks in advance,

Scott

-- 
Scott Collis 
sscollis_at_[hidden] 


  • application/octet-stream attachment: mpiio.cpp