On Wed, Jan 30, 2013 at 3:02 AM, Ralph Castain <rhc@open-mpi.org> wrote:
If your node hardware is the problem, or you decide you do want/need to pursue an FT solution, then you might look at the OMPI-based solutions from parties such as http://fault-tolerance.org or the MPICH2 folks.
Finally, depending on the application, you may be interested in adding checkpoint-based fault tolerance at the application level with the help of libraries such as SCR (http://sourceforge.net/projects/scalablecr/). Though you'll need to spend some time modifying the application source code,
it may be better than system-level based alternatives in the long run.