Just about two months ago I started experimenting with OpenMPI.
I found this piece of software very interesting.
How can I make this software fault tolerant?
As of now I am running this software on two machines
having quad core processors and fedora 10.
I am using openmpi1.3.2.
If a remote machine fails while a parallel task running on both
is it possible to reassign that task assigned to it to some
other node available and
continue the computation instead of aborting the entire computation?
Can anybody tell me where I have to look for more information
I have tried with FT MPI but tired of it.
I have also heard of CIFTS-FTB, can I use for solving this?
Is it necessary to make a source code change?
Have anybody a solution already with you?
If an application is killed by OS at the remote node
mpirun is aborting and reports an error.
What kind of signal the remote orted is to mpirun?
How can I handle it?
I know that I have asked a lot of questions..
I will be thankful to you If anybody could respond with
at least some suggestions.