Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange IO issues in MPI_Finalize
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-03-18 16:16:01


Brian --

While I was on a plane today, I took a whack at making OMPI behave better when you forget to MPI_File_close() a file. Can you try this patch (should apply cleanly to OMPI trunk, v1.6, or v1.7):

    https://svn.open-mpi.org/trac/ompi/changeset/28177

On Mar 18, 2013, at 12:42 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:

> I *believe* that this means that you didn't MPI_File_close a file.
>
> We're not giving a very helpful error message here (it's downright misleading, actually), but I'm pretty sure that this is the case.
>
>
> On Mar 6, 2013, at 10:28 AM, "Smith, Brian E." <smithbe_at_[hidden]> wrote:
>
>> HI all,
>>
>> I have some code that uses parallel netCDF. I've run successfully on Titan (using the Cray MPICH derivative) and on my laptop (also running MPICH). However, when I run on one of our clusters running OMPI, the code barfs in MPI_Finalize() and doesn't write the complete/expected output files:
>>
>> [:17472] *** An error occurred in MPI_File_set_errhandler
>> [:17472] *** on a NULL communicator
>> [:17472] *** Unknown error
>> [:17472] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>> --------------------------------------------------------------------------
>> An MPI process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly. You should
>> double check that everything has shut down cleanly.
>>
>> Reason: After MPI_FINALIZE was invoked
>> Local host:
>> PID: 17472
>> --------------------------------------------------------------------------
>>
>> The stacks are:
>> PMPI_Finalize (pfinalize.c:46)
>> ompi_mpi_finalize (ompi_mpi_finalize.c:272)
>> ompi_file_finalize (file.c:196)
>> opal_obj_run_destructors (opal_object.h:448)
>> file_destructor (file.c:273)
>> mca_io_romio_file_close (io_romio_file_open.c:59)
>> PMPI_File_set_errhandler (pfile_set_errhandler.c:47)
>> ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:52)
>>
>> This is with OMPI 1.6.2 It is pnetCDF 1.3.1 on all 3 platforms.
>>
>> The code appears to have the right participants opening/closing the right files on the right communicators (a mixture of rank 0s on subcomms opening across their subcomms and some nodes opening on MPI_COMM_SELF). It looks to me like some IO is getting delayed until MPI_Finalize() suggesting perhaps I missed a wait() or close() pnetCDF call.
>>
>> I don't necessarily think this is a bug in OMPI, I just don't know where to start looking in my code, since it is working fine on the two different versions of MPICH.
>>
>> Thanks.
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/