On Wed, Jan 30, 2013 at 3:02 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> If your node hardware is the problem, or you decide you do want/need to
> pursue an FT solution, then you might look at the OMPI-based solutions from
> parties such as http://fault-tolerance.org or the MPICH2 folks.
Just as Ralph said, you may look into alternatives. From what I have seen,
MPICH2 provides fault tolerance using BLCR.
The same goes for Intel's MPI (
http://software.intel.com/en-us/forums/topic/296300). Though not free, you
may try it during
a 30-day evaluation period (
It can be interesting to see how the two MPI fair wrt to BLCR-based FT.
Another alternative which may be worth considering is DMTCP (
http://dmtcp.sourceforge.net/) from Northeastern University
for which there has been an interesting podcast recently (
Finally, depending on the application, you may be interested in adding
checkpoint-based fault tolerance at the application level with the help of
libraries such as SCR (http://sourceforge.net/projects/scalablecr/). Though
you'll need to spend some time modifying the application source code,
it may be better than system-level based alternatives in the long run.