Open MPI logo

Open MPI

  |   Home   |   Support   |   FAQ   |  

Title: Approaches for Parallel Applications Fault Tolerance

Author(s):

Richard Graham

Abstract:

As the complexity of high performance computer systems increases or the level of end-to-end engineering integration decreases, the likelihood of software or hardware failure increases. It is, therefore, important to effectively deal with these failures in order to maintain application mean-time-to-failure at levels acceptable to users.

This talk will describe the work done in the Open MPI collaboration to recover from several failure scenarios. This builds on research already done in the the context of the LA-MPI, FT-MPI, LAM/MPI, and PACX-MPI projects and deals with transient and catastrophic network errors, as well as several approaches to handling process failure. It will address how failures are detected, the mechanisms used to work around these failures and allow the applications to continue running, and what level of support, if any, is needed from the application to successfully deploy these solutions. In addition, the performance impact of these solutions on several applications will be discussed.

Presented: Euro PVM/MPI 2006, September, 2006, in Bonn, Germany.

Paper:

euro-pvmmpi-2006-app-ft.pdf (PDF)