Thanks for the feedback. More below:
Is there any MPI implementions which meet the following requirements:
1, it doesn't terminate the whole job when a node is dead?
2, it allows the spare node to replace the dead node and take over the work of the dead node?
As far as I know, FT-MPI meets the two requirements, but it hasn't updated since 2004. Open-mpi is said to combine serveral projects including FT-MPI, but so far, it only provides checkpoinr/restart as a way of fault-tolerance.
On Jun 29, 2010, at 3:44 AM, Íõî£ wrote:Open MPI will terminate the whole job, yes.
> 1, suppose a MPI program involves several nodes, if one node dead, will the program terminate?
> 2, Is there any possibility to extend or shrink the size of MPI communicator size? If so, we can use spare node to replace the dead node?
Fault tolerance and resiliency is an active topic of research and discussion in the MPI-3 forum. But for the moment, most MPI implementations -- including Open MPI -- have fairly draconian responses to the loss of a process and/or node (i.e., kill the rest of the job).
For corporate legal information go to:
users mailing list