Thanks for the feedback. More below:

Is there any MPI implementions which meet the following requirements:

1, it doesn't terminate the whole job when a node is dead?

2, it allows the spare node to replace the dead node and take over the work of the dead node?

As far as I know, FT-MPI meets the two requirements, but it hasn't updated since 2004. Open-mpi is said to combine serveral projects including FT-MPI, but so far, it only provides checkpoinr/restart as a way of fault-tolerance.

Best Regards
Rui
 
2010/6/29 Jeff Squyres <jsquyres@cisco.com>
On Jun 29, 2010, at 3:44 AM, Íõî£ wrote:

> 1, suppose a MPI program involves several nodes, if one node dead, will the program terminate?

Open MPI will terminate the whole job, yes.

> 2, Is there any possibility to extend or shrink the size of MPI communicator size? If so, we can use spare node to replace the dead node?

Currently, no.

Fault tolerance and resiliency is an active topic of research and discussion in the MPI-3 forum.  But for the moment, most MPI implementations -- including Open MPI -- have fairly draconian responses to the loss of a process and/or node (i.e., kill the rest of the job).

--
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users