I think we're some ways away from declaring a "resilient ORTE". Josh and I have been committing pieces of it over the last two years, and Wes just committed another piece the other day that might have been titled "fault tolerant OOB" as it primarily addressed maintaining comm routing during node failures.
Setting aside the obvious MPI issues, there are several branches/organizations working different aspects of the ORTE problem, including:
* fault prediction and proactive migration
* mapping algorithms to minimize failure cascades
* simultaneous failure handling
* alternative wiring methods that eliminate the OOB routing issues
etc. We expect most of those developments to arrive over the next 6-12 months. Once that has occurred, we'll probably be close to what we would call a "resilient" system.
Until then, we are improving, but still far from "resilient".
On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote:
> Josh and Wesley,
> Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux?
> Kenneth A. Lloyd
> CEO - Director of Systems Science
> Watt Systems Technologies Inc.
> This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender.
> devel mailing list