Title: A Checkpoint and Restart Service Specification for Open MPI
Author(s): Joshua Hursey, Jeffrey M. Squyres, Andrew Lumsdaine
Abstract:
HPC systems are growing in both complexity and size, increasing the
opportunity for system failures. Checkpoint and restart techniques are
one of many fault tolerance techniques developed for such adverse
runtime conditions. Because of the variety of available approaches for
checkpoint and restart, HPC system libraries, such as MPI, seeking to
incorporate these techniques would benefit greatly from a portable,
extensible checkpoint and restart framework. This paper presents a
specification for such a framework in Open MPI that allows for the
integration of a variety of checkpoint/restart systems and
protocols. The modular design of the framework allows researchers to
contribute to specialized areas without requiring knowledge of the
entirety of the code base.
Presented: Indiana University Computer Science tech report TR635
Paper:
Bibtex reference:
@techreport{Hursey-Open-MPI-CRS,
Address = {Bloomington, Indiana, USA},
Author = {Joshua Hursey and Jeffrey M. Squyres and Andrew Lumsdaine },
Institution = {Indiana University},
Month = {July},
Number = {TR635},
Title = {A Checkpoint and Restart Service Specification for Open MPI},
Url = {http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR635},
Year = {2006}
}
|