Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] How can I achieve node fail over
From: Sai Sudheesh (saisudheesh_at_[hidden])
Date: 2010-01-06 09:04:56


Hi,

       Just about two months ago I started experimenting with OpenMPI.
       I found this piece of software very interesting.

       How can I make this software fault tolerant?

       As of now I am running this software on two machines
       having quad core processors and fedora 10.
       I am using openmpi1.3.2.

       If a remote machine fails while a parallel task running on both
the machines
       is it possible to reassign that task assigned to it to some
other node available and
       continue the computation instead of aborting the entire computation?

       Can anybody tell me where I have to look for more information
regarding this.
       I have tried with FT MPI but tired of it.
       I have also heard of CIFTS-FTB, can I use for solving this?
       Is it necessary to make a source code change?

       Have anybody a solution already with you?

       If an application is killed by OS at the remote node
       mpirun is aborting and reports an error.
       What kind of signal the remote orted is to mpirun?
       How can I handle it?

       I know that I have asked a lot of questions..
       I will be thankful to you If anybody could respond with
       at least some suggestions.

with love
sudheesh.