Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
From: Wesley Bland (wbland_at_[hidden])
Date: 2010-03-10 13:36:54


Josh,

You mentioned some MCA parameters that you would include in the email, but I
don't see those parameters anywhere. Could you please put those in here to
make testing easier for people.

Wesley

On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> Yesterday evening George, Thomas and I discussed some of their concerns
> about this RFC at the MPI Forum meeting. After the discussion, we seemed to
> be in agreement that the RecoS framework is a good idea and the concepts and
> fixes in this RFC should move forward with a couple of notes:
>
> - They wanted to test the branch a bit more over the next couple of days.
> Some MCA parameters that you will need are at the bottom of this message.
>
> - Reiterate that this RFC only addresses ORTE stability, not OMPI
> stability. The OMPI stability extension is a second step for the line of
> work, and should/will fit in nicely with the RecoS framework being proposed
> in this RFC. The OMPI layer stability will require a significant amount of
> work, but the RecoS framework will provide the ORTE layer stability that is
> required as a foundation for OMPI layer stability in the future.
>
> - The purpose of the ErrMgr becomes slightly unclear with the addition of
> the RecoS framework, since both are focused on responding to faults in the
> system (and RecoS, when enabled, overrides most/all of the ErrMgr
> functionality). Should the RecoS framework be merged with the ErrMgr
> framework to create a new ErrMgr interface?
>
> We are typing to decide if we should merge these frameworks, but at this
> point we are interested in hearing how other developers feel about merging
> the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are
> there any developers out there that are developing ErrMgr components, or are
> using any particular features of the existing ErrMgr framework that they
> would like to see preserved in the next revision. By default, the existing
> default abort behavior of the ErrMgr framework will be preserved, so the
> user will have to 'opt-in' to any fault recovery capabilities.
>
> So we are continuing the discussion a bit more off-list, and will return to
> the list with an updated RFC (and possibly a new branch) soon (hopefully end
> of the week/early next week). I would like to briefly discuss this RFC at
> the Open MPI teleconf next Tuesday.
>
> -- Josh
>
> On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote:
>
> > Sounds good to me.
> >
> > For those casually following this RFC let me summarize its current state.
> >
> > Josh and George (and anyone else that wishes to participate attending the
> forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will
> post any relevant notes from this meeting back to the list afterwards. So
> the RFC is on hold pending the outcome of that meeting. For those developers
> interested in this RFC that will not be able to attend, feel free to
> continue using this thread for discussion.
> >
> > Thanks,
> > Josh
> >
> > On Feb 26, 2010, at 6:09 AM, George Bosilca wrote:
> >
> >>
> >> On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
> >>
> >>> Any of those options are fine with me. I was thinking that if you
> wanted to talk sooner, we might be able to help explain our intentions with
> this framework a bit better. I figure that the framework interface will
> change a bit as we all advance and incorporate our various techniques into
> it. I think that the current interface is a good first step, but there are
> certainly many more steps to come.
> >>>
> >>> I am fine delaying this code a bit, just not too long. Meeting at the
> forum for a while might be a good option (we could probably even arrange to
> call in others if you wanted).
> >>
> >> Sounds good, let do this.
> >>
> >> Thanks,
> >> george.
> >>
> >>>
> >>> Cheers,
> >>> Josh
> >>>
> >>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote:
> >>>
> >>>> If Josh is going to be at the forum, perhaps you folks could chat
> there? Might as well take advantage of being colocated, if possible.
> >>>>
> >>>> Otherwise, I'm available pretty much any time. I can't contribute much
> about the MPI recovery issues, but can contribute to the RTE issues if that
> helps.
> >>>>
> >>>>
> >>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosilca_at_[hidden]>
> wrote:
> >>>> Josh,
> >>>>
> >>>> Next week is a little bit too early as will need some time to figure
> out how to integrate with this new framework, and at what extent our code
> and requirements fit into. Then the week after is the MPI Forum. How about
> on Thursday 11 March?
> >>>>
> >>>> Thanks,
> >>>> george.
> >>>>
> >>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote:
> >>>>
> >>>>> Per my previous suggestion, would it be useful to chat on the phone
> early next week about our various strategies?
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>