Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-03-23 17:29:12


This has been committed in r22872.

Let me know if you see any problems with the commit.

-- Josh

On Mar 23, 2010, at 7:57 AM, Joshua Hursey wrote:

> Just a reminder that this RFC will go into the trunk this evening
> unless there are strong objections.
>
> We intend to let this soak for a few days then bring it over to the
> 1.5 series (after the 1.5.0 release).
>
> -- Josh
>
> On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote:
>
>> (Updated RFC, per offline discussion)
>>
>> WHAT: Merge a tmp branch for fault recovery development into the
>> OMPI trunk
>>
>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault
>> recovery capabilities
>>
>> WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework
>>
>> TIMEOUT: Barring objections and/or further requests for delay,
>> evening of March 23
>>
>> REFERENCE BRANCH: http://bitbucket.org/jjhursey/orte-errmgr/
>>
>> =
>> =====================================================================
>>
>> BACKGROUND:
>>
>> Josh and Ralph have been working on a private branch off of the
>> trunk on extended fault recovery procedures, mostly impacting ORTE.
>> The new code optionally allows ORTE to recover from failed nodes,
>> moving processes to other nodes in order to maintain operation. In
>> addition, the code provides better support for recovering from
>> individual process failures.
>>
>> Not all of the work done on the private branch will be brought over
>> in this commit. Some of the MPI-specific code that allows recovery
>> from process failure on-the-fly will be committed separately at a
>> later date. This commit provides the foundation for ORTE
>> stabilization that can be built upon to provide OMPI layer
>> stability in the future.
>>
>> This commit significantly modifies the ORTE ErrMgr framework to
>> support those advanced recovery operations. The ErrMgr public
>> interface has been preserved since it is used in various places
>> throughout the codebase, and should continue to be used as normal.
>> The ErrMgr framework has been internally redesigned to better
>> support multiple strategies for responding to failures (represents
>> a merge of the old ErrMgr and the RecoS framework, into the ErrMgr
>> 3.0 component interface). The default (base) mode will continue to
>> work exactly the same as today, aborting the job when a failure
>> occurs. However, if the user elects to enable recovery then one or
>> more ErrMgr components will be activated to determine the recovery
>> policy for the job.
>>
>> We have created a public repo (reference branch, above) with the
>> code to be merged into the trunk (r22815). Please feel free to
>> check it out and test it.
>>
>> NOTE: The new recovery capability is only active if the user elects
>> to use it by setting the MCA parameter errmgr_base_enable_recovery
>> to '1'.
>>
>> NOTE: More ErrMgr recovery components will be coming online in the
>> near future, currently this branch only includes the 'orcm' module
>> for ORTE process recovery (not MPI processes). If you want to
>> experiment with this feature, below are the MCA parameters that you
>> will need to get started.
>>> #################################
>>> plm=rsh
>>> rmaps=resilient
>>> routed=cm
>>> errmgr_base_enable_recovery=1
>>> #################################
>>
>> Comments, suggestions, and corrections are welcome!
>>
>>
>>
>> On Mar 10, 2010, at 2:22 PM, Josh Hursey wrote:
>>
>>> Wesley,
>>>
>>> Thanks for catching that oversight. Below are the MCA parameters
>>> that you should need at the moment:
>>> #####################################
>>> # Use the C/R Process Migration Recovery Supervisor
>>> recos_base_enable=1
>>> # Only use the 'rsh' launcher, other launchers will be supported
>>> later
>>> plm=rsh
>>> # The resilient mapper knows how to use RecoS and deal with
>>> recovering procs
>>> rmaps=resilient
>>> # 'cm' component is the only one that can handle failures at the
>>> moment
>>> routed=cm
>>> #####################################
>>>
>>> Let me know if you have any troubles.
>>>
>>> -- Josh
>>>
>>> On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote:
>>>
>>>> Josh,
>>>>
>>>> You mentioned some MCA parameters that you would include in the
>>>> email, but I don't see those parameters anywhere. Could you
>>>> please put those in here to make testing easier for people.
>>>>
>>>> Wesley
>>>>
>>>> On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhursey_at_open-
>>>> mpi.org> wrote:
>>>> Yesterday evening George, Thomas and I discussed some of their
>>>> concerns about this RFC at the MPI Forum meeting. After the
>>>> discussion, we seemed to be in agreement that the RecoS framework
>>>> is a good idea and the concepts and fixes in this RFC should move
>>>> forward with a couple of notes:
>>>>
>>>> - They wanted to test the branch a bit more over the next couple
>>>> of days. Some MCA parameters that you will need are at the bottom
>>>> of this message.
>>>>
>>>> - Reiterate that this RFC only addresses ORTE stability, not OMPI
>>>> stability. The OMPI stability extension is a second step for the
>>>> line of work, and should/will fit in nicely with the RecoS
>>>> framework being proposed in this RFC. The OMPI layer stability
>>>> will require a significant amount of work, but the RecoS
>>>> framework will provide the ORTE layer stability that is required
>>>> as a foundation for OMPI layer stability in the future.
>>>>
>>>> - The purpose of the ErrMgr becomes slightly unclear with the
>>>> addition of the RecoS framework, since both are focused on
>>>> responding to faults in the system (and RecoS, when enabled,
>>>> overrides most/all of the ErrMgr functionality). Should the RecoS
>>>> framework be merged with the ErrMgr framework to create a new
>>>> ErrMgr interface?
>>>>
>>>> We are typing to decide if we should merge these frameworks, but
>>>> at this point we are interested in hearing how other developers
>>>> feel about merging the ErrMgr and RecoS frameworks, which would
>>>> change the ErrMgr API. Are there any developers out there that
>>>> are developing ErrMgr components, or are using any particular
>>>> features of the existing ErrMgr framework that they would like to
>>>> see preserved in the next revision. By default, the existing
>>>> default abort behavior of the ErrMgr framework will be preserved,
>>>> so the user will have to 'opt-in' to any fault recovery
>>>> capabilities.
>>>>
>>>> So we are continuing the discussion a bit more off-list, and will
>>>> return to the list with an updated RFC (and possibly a new
>>>> branch) soon (hopefully end of the week/early next week). I would
>>>> like to briefly discuss this RFC at the Open MPI teleconf next
>>>> Tuesday.
>>>>
>>>> -- Josh
>>>>
>>>> On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote:
>>>>
>>>>> Sounds good to me.
>>>>>
>>>>> For those casually following this RFC let me summarize its
>>>>> current state.
>>>>>
>>>>> Josh and George (and anyone else that wishes to participate
>>>>> attending the forum) will meet sometime at the next MPI Forum
>>>>> meeting (March 8-10). I will post any relevant notes from this
>>>>> meeting back to the list afterwards. So the RFC is on hold
>>>>> pending the outcome of that meeting. For those developers
>>>>> interested in this RFC that will not be able to attend, feel
>>>>> free to continue using this thread for discussion.
>>>>>
>>>>> Thanks,
>>>>> Josh
>>>>>
>>>>> On Feb 26, 2010, at 6:09 AM, George Bosilca wrote:
>>>>>
>>>>>>
>>>>>> On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
>>>>>>
>>>>>>> Any of those options are fine with me. I was thinking that if
>>>>>>> you wanted to talk sooner, we might be able to help explain
>>>>>>> our intentions with this framework a bit better. I figure that
>>>>>>> the framework interface will change a bit as we all advance
>>>>>>> and incorporate our various techniques into it. I think that
>>>>>>> the current interface is a good first step, but there are
>>>>>>> certainly many more steps to come.
>>>>>>>
>>>>>>> I am fine delaying this code a bit, just not too long. Meeting
>>>>>>> at the forum for a while might be a good option (we could
>>>>>>> probably even arrange to call in others if you wanted).
>>>>>>
>>>>>> Sounds good, let do this.
>>>>>>
>>>>>> Thanks,
>>>>>> george.
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Josh
>>>>>>>
>>>>>>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> If Josh is going to be at the forum, perhaps you folks could
>>>>>>>> chat there? Might as well take advantage of being colocated,
>>>>>>>> if possible.
>>>>>>>>
>>>>>>>> Otherwise, I'm available pretty much any time. I can't
>>>>>>>> contribute much about the MPI recovery issues, but can
>>>>>>>> contribute to the RTE issues if that helps.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosilca_at_[hidden]
>>>>>>>> > wrote:
>>>>>>>> Josh,
>>>>>>>>
>>>>>>>> Next week is a little bit too early as will need some time to
>>>>>>>> figure out how to integrate with this new framework, and at
>>>>>>>> what extent our code and requirements fit into. Then the week
>>>>>>>> after is the MPI Forum. How about on Thursday 11 March?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> george.
>>>>>>>>
>>>>>>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote:
>>>>>>>>
>>>>>>>>> Per my previous suggestion, would it be useful to chat on
>>>>>>>>> the phone early next week about our various strategies?
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel