One point I've been trying to put forward in my domain is, currently, high performance computing != high reliability computing. Not by a long shot. Seems that they are orthogonally coupled.

There are many pieces to this problem-puzzle. Some of these pieces are inter-related. Some of my work has dealt with adaptive approaches - especially re: cascade, and what Ralph refers to as "rewiring", or routing issues.

If and when I have anything I believe meaningful to contribute, I will.

On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote:
It has been on my to-do list for a while to start a FAQ listing of the various resilience/FT related activities in and around Open MPI. This would provide a starting location for users and new developers could go to for an overview of each of the features, and how to activate/use the feature.

I'll try to bump that up the priority list and post a message once it is ready. Probably a month or so off since I need to collect some information from various developers.

-- Josh

On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <> wrote:
I think we're some ways away from declaring a "resilient ORTE". Josh and I have been committing pieces of it over the last two years, and Wes just committed another piece the other day that might have been titled "fault tolerant OOB" as it primarily addressed maintaining comm routing during node failures.

Setting aside the obvious MPI issues, there are several branches/organizations working different aspects of the ORTE problem, including:

* fault prediction and proactive migration

* mapping algorithms to minimize failure cascades

* simultaneous failure handling

* alternative wiring methods that eliminate the OOB routing issues

etc. We expect most of those developments to arrive over the next 6-12 months. Once that has occurred, we'll probably be close to what we would call a "resilient" system.

Until then, we are improving, but still far from "resilient".

On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote:

Josh and Wesley,

Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux?

Kenneth A. Lloyd
CEO - Director of Systems Science
Watt Systems Technologies Inc.

This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender.

devel mailing list

devel mailing list

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

devel mailing list