Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Resilience 2011
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-27 09:13:36


On Jun 27, 2011, at 6:57 AM, Ken Lloyd wrote:

> One point I've been trying to put forward in my domain is, currently, high performance computing != high reliability computing. Not by a long shot. Seems that they are orthogonally coupled.

I think that has been true in the past - an emerging community is trying to bring the two back together, but the tradeoffs do pose challenges. In some ways, the RTE part of the equation is more manageable than the MPI side, IMO.

>
> There are many pieces to this problem-puzzle. Some of these pieces are inter-related. Some of my work has dealt with adaptive approaches - especially re: cascade, and what Ralph refers to as "rewiring", or routing issues.

Most of my development is taking place in the embedded world re ORCM (an OMPI-related project based on ORTE). I try to port most of it back to the OMPI trunk, but have fallen woefully behind over the last six months or so. ORCM has recently started getting contributions from a couple of universities, one focused on prediction/migration and another on wireup, that should translate directly to OMPI.

There is some code already in the trunk re mapping to avoid failure cascades. In my "spare" time, I continue to work on it. Always open to exchanging ideas :-)

>
>
> If and when I have anything I believe meaningful to contribute, I will.
>
> On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote:
>> It has been on my to-do list for a while to start a FAQ listing of the various resilience/FT related activities in and around Open MPI. This would provide a starting location for users and new developers could go to for an overview of each of the features, and how to activate/use the feature.
>>
>>
>> I'll try to bump that up the priority list and post a message once it is ready. Probably a month or so off since I need to collect some information from various developers.
>>
>>
>> -- Josh
>>
>> On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I think we're some ways away from declaring a "resilient ORTE". Josh and I have been committing pieces of it over the last two years, and Wes just committed another piece the other day that might have been titled "fault tolerant OOB" as it primarily addressed maintaining comm routing during node failures.
>>
>>
>> Setting aside the obvious MPI issues, there are several branches/organizations working different aspects of the ORTE problem, including:
>>
>>
>> * fault prediction and proactive migration
>>
>>
>> * mapping algorithms to minimize failure cascades
>>
>>
>> * simultaneous failure handling
>>
>>
>> * alternative wiring methods that eliminate the OOB routing issues
>>
>>
>> etc. We expect most of those developments to arrive over the next 6-12 months. Once that has occurred, we'll probably be close to what we would call a "resilient" system.
>>
>>
>> Until then, we are improving, but still far from "resilient".
>>
>>
>>
>>
>>
>> On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote:
>>
>>
>>>
>>> Josh and Wesley,
>>>
>>> Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux?
>>>
>>> http://xcr.cenit.latech.edu/resilience2011/
>>>
>>> =====================
>>> Kenneth A. Lloyd
>>> CEO - Director of Systems Science
>>> Watt Systems Technologies Inc.
>>> www.wattsys.com
>>> kenneth.lloyd_at_[hidden]
>>>
>>> This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender.
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel