Thanks for the feedback. More below:
Is there any MPI implementions which meet the following requirements:
1, it doesn't terminate the whole job when a node is dead?
2, it allows the spare node to replace the dead node and take over the work
of the dead node?
As far as I know, FT-MPI meets the two requirements, but it hasn't updated
since 2004. Open-mpi is said to combine serveral projects including FT-MPI,
but so far, it only provides checkpoinr/restart as a way of fault-tolerance.
2010/6/29 Jeff Squyres <jsquyres_at_[hidden]>
> On Jun 29, 2010, at 3:44 AM, Íõî£ wrote:
> > 1, suppose a MPI program involves several nodes, if one node dead, will
> the program terminate?
> Open MPI will terminate the whole job, yes.
> > 2, Is there any possibility to extend or shrink the size of MPI
> communicator size? If so, we can use spare node to replace the dead node?
> Currently, no.
> Fault tolerance and resiliency is an active topic of research and
> discussion in the MPI-3 forum. But for the moment, most MPI implementations
> -- including Open MPI -- have fairly draconian responses to the loss of a
> process and/or node (i.e., kill the rest of the job).
> Jeff Squyres
> For corporate legal information go to:
> users mailing list