Lately I've been reading lots of papers about fault tolerance for MPI
applications. All seemed very nice and clear. But as soon as I pass the
reading part to start testing I had my surprise as there I can not find
implementations. The best I could find is the possibility of manually
checkpoint and restart the application. No checkpoint protocol, no
checkpoint manager, no recovery protocol.
Can you please help and point me to a user transparent fault tolerance
implementation for MPI applications?
Thanks a lot,