On Thu, Aug 18, 2011 at 08:46:46AM -0700, Tom Rosmond wrote:
> We have a large fortran application designed to run doing IO with either
> mpi_io or fortran direct access. On a linux workstation (16 AMD cores)
> running openmpi 1.5.3 and Intel fortran 12.0 we are having trouble with
> random failures with the mpi_io option which do not occur with
> conventional fortran direct access. We are using ext3 file systems, and
> I have seen some references hinting of similar problems with the
> ext3/mpiio combination. The application with the mpi_io option runs
> flawlessly on Cray architectures with Lustre file systems, so we are
> also suspicious of the ext3/mpiio combination. Does anyone else have
> experience with this combination that could shed some light on the
> problem, and hopefully some suggested solutions?
I'm glad to hear you're having success with mpi-io on Cray/Lustre.
That platform was a bit touchy for a while, but has gotten better over
the last two years.
My first guess would be that your linux workstation does not implement
a "strict enough" file system lock. ROMIO relies on the "fcntl" locks
to provide exclusive access to files at some points in the code.
Does your application use collective I/O ? It sounds like if you can
swap fortran and mpi-io so easily that maybe you do not. If there's
a way to make collective MPI-IO calls, that will eliminate many of the
fcntl lock calls.
Do you use MPI datatypes to describe either a file view or the
application data? These noncontiguous in memory and/or noncontiguous
in file access patterns will also trigger fcntl lock calls. You can
use an MPI-IO hint to disable data sieving, at a potentially
disastrous performance cost.
Mathematics and Computer Science Division
Argonne National Lab, IL USA