Sorry for the late answer, we have quite a few things on our todo list right now. Here are few concerns I'm having about the proposed approach.

1. We would have preferred to have a list of processes for the ompi_errhandler_runtime_callback function. We don't necessary care about the error code, but having a list will allow us to move the notifications per bulk instead of one by one.

2. You made the registration of the callbacks ordered, and added special arguments to append or prepend callbacks to the list. Right now I can't figure out a good reason on how to use it especially that the order might be impose on the order the modules are loaded by the frameworks, thus not something we can easily control.

3. The callback list. The concept is useful, I don't know about the implementation. The current version doesn't support stopping the propagation of the error signal, which might be an issue in some cases. I can picture the fact that one level know about the issue, and know how to fix it, so the error does not need to propagate to other levels. This can be implemented in the old way interrupts were managed in DOS, with basically a simple _get / _set type of interface. If a callback wants to propagate the error it has first to retrieve the ancestor on the moment when it registered the callback and then explicitly calls it upon error.

Again, nothing major in the short term as it will take a significant amount of work to move the only user of such error handling capability (the FT prototype) back over the current version of the ORTE.


On Jul 3, 2013, at 06:45 , Ralph Castain <rhc@open-mpi.org> wrote:

**** NOTICE: This RFC modifies the MPI-RTE interface ****

WHAT: revise the RTE error handling to allow registration of callbacks upon RTE-detected errors

WHY: currently, the RTE aborts the process if an RTE-detected error occurs. This allows the upper layers (e.g., MPI) no chance to implement their own error response strategy, and it precludes allowing user-defined error handling.

TIMEOUT:  let's go for July 19th, pending further discussion

George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.

The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.

In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first/last.

The errhandler callback function passes the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of process names.

The work is available for review in my bitbucket:

I've attached the svn diff as well.

Appreciate your comments - nothing in concrete.

devel mailing list