Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Aurelien Bouteiller (bouteill_at_[hidden])
Date: 2007-09-08 14:33:25

Le 6 sept. 07 à 09:27, Terry D. Dontje a écrit :

> Gleb Natapov wrote:
>> On Thu, Sep 06, 2007 at 06:50:43AM -0600, Ralph H Castain wrote:
>>> WHAT: Decide upon how to handle MPI applications where one or more
>>> processes exit without calling MPI_Finalize
>>> WHY: Some applications can abort via an exit call instead of
>>> calling MPI_Abort when a library (or something else) calls
>>> exit. This situation is outside a user's control, so they
>>> cannot fix it.
>>> WHERE: Refer to ticket #1144 - code changes are TBD
>>> WHEN: Up to the group
>> [snip]
>>> Does the general community feel we should do anything here, or is
>>> this a
>>> "bug" that should be fixed by the entity calling "exit"? I should
>>> note that
>>> it actually is bad behavior (IMHO) for any library to call "exit"
>>> - but
>>> then, we do that in some situations too, so perhaps we shouldn't
>>> cast
>>> stones!
>>> Any suggested solutions or comments on whether or not we should
>>> do anything
>>> would be appreciated.
>> IMO (a) should be implemented.
> I don't think (b) should be implemented. However, one could
> register an
> atexit handler that calls MPI_finalize. Therefore, the exiting
> process
> would be stuck until everyone else reaches their exits or finalize.
> That being said I think (a) probably makes more sense and adheres
> to the
> MPI standard.
I agree (b) is not a good idea. However I am not very pleased by (a)
either. It totally prevent any process Fault Tolerant mechanism if we
go that way. If we plan to add some failure detection mechanism to
RTE and failure management (to avoid Finalize to hang), we should add
the ability to plug-in FT specific error handlers. The default error
handler should do exactly what is proposed by Ralph, but nowhere else
(than in this handler) the RTE code should assume that the
application is aborting when a failure occurs. If it is a FT
application it might just not abort and recover.


> --td
> _______________________________________________
> devel mailing list
> devel_at_[hidden]