At UTK we focus on developing two generic frameworks for scalable fault tolerant approaches. One is based on uncoordinated checkpoint/restart while the other is application level.
1) uncoordinated C/R based on message logging. Such approaches are fully automatic, rely on an external checkpoint/restart mechanism (BLCR currently), and do not require any synchronization. A process restarts independently, and it catch-up with the others. During its recovery the others can continue their execution undisturbed. The framework developed by UTK is currently used by at our knowledge by two other team to implement different uncoordinated mechanisms.
Redesigning the Message Logging Model for High Performance, Aurelien Bouteiller, G. Bosilca, and J. Dongarra, accepted in Concurrency and Computing: Practice and Experience, January 2010 (http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/isc-cppe-final.pdf)
Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols, George Bosilca, Aurelien Bouteiller, Thomas Herault, Pierre Lemarinier, and Jack Dongarra, Euro MPI 2010 (http://icl.cs.utk.edu/news_pub/submissions/hpc-ml.pdf)
Reasons to be Pessimist or Optimist for Failure Recovery in High Performance Clusters, Aurelien Bouteiller, Thomas Ropars, George Bosilca, Christine Morin and Jack Dongarra. , Cluster 2009 (http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/msglog.final.pdf)
2) application level. We developed a framework allowing distinct application level responses to faults. In other terms the error is reported up to the application level, which become responsible for handling the error. The "still alive" processes in the MPI application as well as the whole runtime system remain totally functional, they can continue their work without interruption. On top of this generic framework we implemented a method very similar with FT-MPI, including some additions (such as supporting the MPI 2.0 standard).
Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems, Graham E. Fagg, Edgar Gabriel, George Bosilca, Thara Angskun, Zizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. Dongarra, Proceedings of the ISC2004 meeting Heidelberg, June 23, 2004. (http://www.netlib.org/utk/people/JackDongarra/PAPERS/isc2004-FT-MPI.pdf)
Hope this helps,
On Apr 22, 2011, at 15:03 , Joshua Hursey wrote:
> On Apr 22, 2011, at 1:20 PM, N.M. Maclaren wrote:
>> On Apr 22 2011, Ralph Castain wrote:
>>> Several of us are. Josh and George (plus teammates), and some other outside folks, are working the MPI side of it.
>>> I'm working only the ORTE side of the problem.
>>> Quite a bit of capability is already in the trunk, but there is always more to do :-)
>> Is there a specification of what objectives are covered by 'fault-tolerant'?
> We do not really have a website to point folks to at the moment. Some of the existing functionally in and planned functionality for Open MPI has been announced and documented, but not uniformly or in a central place at the moment. We have a developers meeting in a couple weeks and this is a topic I am planning on covering:
> Once something is available, we'll post to the users/developers lists so that people know where to look for developments.
> I am responsible for two fault tolerance features in Open MPI: Checkpoint/Restart and MPI Forum's Fault Tolerance Working Group proposals. The Checkpoint/Restart support is documented here:
> Most of my attention is focused on the MPI Forum's Fault Tolerance Working Group proposals that are focused on enabling fault tolerant applications to be developed on top of MPI (so non-transparent fault tolerance). The Open MPI prototype is not yet publicly available, but soon. Information about the semantics and interfaces of that project can be found at the links below:
> That is what I have been up to regarding fault tolerance. Others can probably elaborate on what they are working on if they wish.
> -- Josh
>> Nick Maclaren.
>> devel mailing list
> devel mailing list
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville