Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Resilience 2011
From: Ken Lloyd (kenneth.lloyd_at_[hidden])
Date: 2011-06-27 08:57:06


One point I've been trying to put forward in my domain is, currently,
high performance computing != high reliability computing. Not by a long
shot. Seems that they are orthogonally coupled.

There are many pieces to this problem-puzzle. Some of these pieces are
inter-related. Some of my work has dealt with adaptive approaches -
especially re: cascade, and what Ralph refers to as "rewiring", or
routing issues.

If and when I have anything I believe meaningful to contribute, I will.

On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote:

> It has been on my to-do list for a while to start a FAQ listing of the
> various resilience/FT related activities in and around Open MPI. This
> would provide a starting location for users and new developers could
> go to for an overview of each of the features, and how to activate/use
> the feature.
>
>
>
> I'll try to bump that up the priority list and post a message once it
> is ready. Probably a month or so off since I need to collect some
> information from various developers.
>
>
> -- Josh
>
>
> On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>
> I think we're some ways away from declaring a "resilient
> ORTE". Josh and I have been committing pieces of it over the
> last two years, and Wes just committed another piece the other
> day that might have been titled "fault tolerant OOB" as it
> primarily addressed maintaining comm routing during node
> failures.
>
>
>
> Setting aside the obvious MPI issues, there are several
> branches/organizations working different aspects of the ORTE
> problem, including:
>
>
> * fault prediction and proactive migration
>
>
> * mapping algorithms to minimize failure cascades
>
>
> * simultaneous failure handling
>
>
> * alternative wiring methods that eliminate the OOB routing
> issues
>
>
> etc. We expect most of those developments to arrive over the
> next 6-12 months. Once that has occurred, we'll probably be
> close to what we would call a "resilient" system.
>
>
> Until then, we are improving, but still far from "resilient".
>
>
>
>
>
> On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote:
>
>
>
> >
> > Josh and Wesley,
> >
> > Will you be presenting Resilient ORTE at Resilience 2011 in
> > Bordeaux?
> >
> > http://xcr.cenit.latech.edu/resilience2011/
> >
> > =====================
> > Kenneth A. Lloyd
> > CEO - Director of Systems Science
> > Watt Systems Technologies Inc.
> > www.wattsys.com
> > kenneth.lloyd_at_[hidden]
> >
> > This e-mail is covered by the Electronic Communications
> > Privacy Act, 18 U.S.C. 2510-2521 and is intended only for
> > the addressee named above. It may contain privileged or
> > confidential information. If you are not the addressee you
> > must not copy, distribute, disclose or use any of the
> > information in it. If you have received it in error please
> > delete it and immediately notify the sender.
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel