That's a little bit strong - OMPI still supports checkpoint/restart as a fault tolerance mechanism. There really isn't anything the sys admin has to do, though - what is required is that users periodically order their programs to checkpoint so they can be restarted after a failure.
Checkpointing is typically done either by the app itself (say, when it reaches some point it feels is a good one to save), or using a script that just orders a checkpoint every so many seconds.
What we have said is that we don't believe the FT "run thru failure" position pushed by UTK is particularly required at this time. Partly a question of impact vs benefit, mostly due to competing approaches offering equivalent fault recovery capability with less impact. But that's a separate discussion.
On Jun 19, 2012, at 11:16 AM, George Bosilca wrote:
> It has been clearly stated that the official position pushed forward by a majority of the Open MPI developer community is that fault tolerance is not needed so we (read this as the official version of Open MPI) do not support it.
> However, a group of researchers have been working toward a version of Open MPI that supports the last fault tolerance proposal submitted for consideration to the MPI Forum. You can access it at https://bitbucket.org/jjhursey/ompi-ulfm-rts.
> On Jun 19, 2012, at 09:58 , éæ¾ wrote:
>> Hi all,
>> Can anyone explain me the fault tolerant features in OpenMPI? I've read the FAQs and some papers about this topic listed in open-mpi.org, but still can't figure out when one node of my supercomputer system fails down during computing, what would happen with the fault tolerant mechanism in OpenMPI, and what should we system administrator do after the failure (or before).
>> Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he want the fault tolerant feature.
>> Thanks very much.
>> CHEN Song
>> R&D Department
>> National Supercomputer Center in Tianjin
>> Binhai New Area, Tianjin, China
>> users mailing list
> users mailing list