Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] about MPI
From: Íõî£ (wangraying_at_[hidden])
Date: 2010-06-29 21:35:32


Thanks for the feedback. More below:

Is there any MPI implementions which meet the following requirements:

1, it doesn't terminate the whole job when a node is dead?

2, it allows the spare node to replace the dead node and take over the work
of the dead node?

As far as I know, FT-MPI meets the two requirements, but it hasn't updated
since 2004. Open-mpi is said to combine serveral projects including FT-MPI,
but so far, it only provides checkpoinr/restart as a way of fault-tolerance.

Best Regards
Rui

2010/6/29 Jeff Squyres <jsquyres_at_[hidden]>

> On Jun 29, 2010, at 3:44 AM, Íõî£ wrote:
>
> > 1, suppose a MPI program involves several nodes, if one node dead, will
> the program terminate?
>
> Open MPI will terminate the whole job, yes.
>
> > 2, Is there any possibility to extend or shrink the size of MPI
> communicator size? If so, we can use spare node to replace the dead node?
>
> Currently, no.
>
> Fault tolerance and resiliency is an active topic of research and
> discussion in the MPI-3 forum. But for the moment, most MPI implementations
> -- including Open MPI -- have fairly draconian responses to the loss of a
> process and/or node (i.e., kill the rest of the job).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>