Fault Tolerant Open MPI Prototype


Website has moved

The website for the Open MPI prototype of the MPI Forum fault tolerance proposal has moved to:
http://fault-tolerance.org/

The remainder of this page is meant for archival purposes only.


About

This website provides early access to a fault tolerant version of the Open MPI implementation of the MPI standard. The fault tolerance technique provided by this implementation is being formulated by the MPI Forum's Fault Tolerance Working Group. Note that is a different technique than was taken in the FT-MPI project from the University of Tennessee - Knoxville (UTK).

The User-Level Failure Mititagion (ULFM) Run-Through Stabilization (RTS) proposal was developed by the MPI Forum's Fault Tolerance Working Group. The ULFM proposal focuses on maintaining the MPI environment through the failure of one or more processes in the MPI universe. An MPI application that wishes to take advantage of this capability must, at least, change the default error handler on MPI_COMM_WORLD (and possibly other communicators) from MPI_ERRORS_ARE_FATAL to another error handler (e.g., MPI_ERRORS_RETURN). Further details regarding the use of this new capability are described in the proposal, linked below with the prototype.

Early access to the Open MPI fault tolerance development branch is provided to application developers so that they may start playing around with the new interface. Development has focused on providing a correct interface, not necessarily a high performance, scalable implementation (that will come later). Our intention is to merge this branch back into the mainline Open MPI trunk once it is ready.

Useful Links:

Active Developers

Please send any questions, comments, suggestions, bug reports to any of the developers listed below. A google search should turn up our email addresses.


©2010 - 2012 Josh Hursey