Title: Approaches for Parallel Applications Fault Tolerance
Author(s): Richard Graham
As the complexity of high performance computer systems
increases or the level of end-to-end engineering integration
decreases, the likelihood of software or hardware failure increases.
It is, therefore, important to effectively deal with these failures in
order to maintain application mean-time-to-failure at levels
acceptable to users.
This talk will describe the work done in the Open MPI collaboration
to recover from several failure scenarios. This builds on research
already done in the the context of the LA-MPI, FT-MPI, LAM/MPI, and
PACX-MPI projects and deals with transient and catastrophic network
errors, as well as several approaches to handling process failure. It
will address how failures are detected, the mechanisms used to work
around these failures and allow the applications to continue running,
and what level of support, if any, is needed from the application to
successfully deploy these solutions. In addition, the performance
impact of these solutions on several applications will be discussed.
Presented: Euro PVM/MPI 2006, September, 2006, in Bonn, Germany.