Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Resilient ORTE
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-06-09 13:21:08


So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
-------------
orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
-------------

Which is a callback that just calls abort (which is what we want to do
by default):
-------------
void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
    ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
}
-------------

This is what I want to replace. I do -not- want ompi to abort just
because a process failed. So I need a way to replace or remove this
callback, and put in my own callback that 'does the right thing'.

The current patch allows me to overwrite the callback when I call:
-------------
orte_errmgr.set_fault_callback(&my_callback);
-------------
Which is fine with me.

At the point I do not want my_callback to be active any more (say in
MPI_Finalize) I would like to replace it with the old callback. To do
so, with the patch's interface, I would have to know what the previous
callback was and do:
-------------
orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
-------------

This comes at a slight maintenance burden since now there will be two
places in the code that must explicitly reference
'ompi_errhandler_runtime_callback' - if it ever changed then both
sites would have to be updated.

If you use the 'sigaction-like' interface then upon registration I
would get the previous handler back (which would point to
'ompi_errhandler_runtime_callback), and I can store it for later:
-------------
orte_errmgr.set_fault_callback(&my_callback, prev_callback);
-------------

And when it comes time to deregister my callback all I need to do is
replace it with the previous callback - which I have a reference to,
but do not need the explicit name of (passing NULL as the second
argument tells the registration function that I don't care about the
current callback):
-------------
orte_errmgr.set_fault_callback(&prev_callback, NULL);
-------------

So the API in the patch is fine, and I can work with it. I just
suggested that it might be slightly better to return the previous
callback (as is done in other standard interfaces - e.g., sigaction)
in case we wanted to do something with it later.

What seems to be proposed now is making the errmgr keep a list of all
registered callbacks and call them in some order. This seems odd, and
definitely more complex. Maybe it was just not well explained.

Maybe that is just the "computer scientist" in me :)

-- Josh

On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> You mean you want the abort API to point somewhere else, without using a new
> component?
> Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
> anytime next week, if that helps.
>
> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:
>>
>> As long as there is the ability to remove and replace a callback I'm
>> fine. I personally think that forcing the errmgr to track ordering of
>> callback registration makes it a more complex solution, but as long as
>> it works.
>>
>> In particular I need to replace the default 'abort' errmgr call in
>> OMPI with something else. If both are called, then this does not help
>> me at all - since the abort behavior will be activated either before
>> or after my callback. So can you explain how I would do that with the
>> current or the proposed interface?
>>
>> -- Josh
>>
>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> > I agree - let's not get overly complex unless we can clearly articulate
>> > a
>> > requirement to do so.
>> >
>> > On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca <bosilca_at_[hidden]>
>> > wrote:
>> >>
>> >> This will require exactly opposite registration and de-registration
>> >> order,
>> >> or no de-registration at all (aka no way to unload a component). Or
>> >> some
>> >> even more complex code to deal with internally.
>> >>
>> >> If the error manager handle the callbacks it can use the registration
>> >> ordering (which will be what the the approach can do), and can enforce
>> >> that
>> >> all callbacks will be called. I would rather prefer this approach.
>> >>
>> >>  george.
>> >>
>> >> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>> >>
>> >> > I would prefer returning the previous callback instead of relying on
>> >> > the errmgr to get the ordering right. Additionally, when I want to
>> >> > unregister (or replace) a call back it is easy to do that with a
>> >> > single interface, than introducing a new one to remove a particular
>> >> > callback.
>> >> > Register:
>> >> >  ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>> >> > Deregister:
>> >> >  ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>> >> > or to eliminate all callbacks (if you needed that for somme reason):
>> >> >  ompi_errmgr.set_fault_callback(NULL, old_callback);
>> >>
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey