On Wed, Dec 1, 2010 at 8:28 AM, Rob Latham <robl_at_[hidden]> wrote:
> On Mon, Nov 22, 2010 at 04:40:14PM -0700, James Overfelt wrote:
>> I have a small test case where a file created with MPI_File_open
>> is still open at the time MPI_Finalize is called. In the actual
>> program there are lots of open files and it would be nice to avoid the
>> resulting "Your MPI job will now abort." by either having MPI_Finalize
>> close the files or honor the error handler and return an error code
>> without an abort.
>> I've tried with with OpenMPI 1.4.3 and 1.5 with the same results.
>> Attached are the configure, compile and source files and the whole
>> program follows.
> under MPICH2, this simple test program does not abort. You leak a lot
> of resources (e.g. info structure allocated is not freed) but it
> sounds like you are well aware of that.
> under openmpi, this test program fails because openmpi is trying to
> help you out. I'm going to need some help from the openmpi folks
> here, but the backtrace makes it look like MPI_Finalize is setting the
> "no more mpi calls allowed" flag, and then goes and calls some mpi
> routines to clean up the opened files:
> Breakpoint 1, 0xb7f7c346 in PMPI_Barrier () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> (gdb) where
> #0 0xb7f7c346 in PMPI_Barrier () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #1 0xb78a4c25 in mca_io_romio_dist_MPI_File_close () from /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so
> #2 0xb787e8b3 in mca_io_romio_file_close () from /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so
> #3 0xb7f591b1 in file_destructor () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #4 0xb7f58f28 in ompi_file_finalize () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #5 0xb7f67eb3 in ompi_mpi_finalize () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #6 0xb7f82828 in PMPI_Finalize () from /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #7 0x0804f9c2 in main (argc=1, argv=0xbfffed94) at file_error.cc:17
> Why is there an MPI_Barrier in the close path? It has to do with our
> implementation of shared file pointers. If you run this test on a file system
> that does not support shared file pointers ( PVFS, for example), you might get
> a little further.
> So, I think the ball is back in the OpenMPI court: they have to
> re-jigger the order of the destructors so that closing files comes a
> little earlier in the shutdown process.
Thank you, that is the answer I was hoping for: I'm not crazy and
it should be an easy fix. I'll look through the OpenMPI source code
and maybe suggest a fix.