Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] How can I achieve node fail over
From: Sai Sudheesh (saisudheesh_at_[hidden])
Date: 2010-01-11 22:37:50


Hi Josh,

           First of all...thanks for your response..
           There was some typos in my mail
           making it vague at some portions.

           Let me make the scenarios mentioned in the
           previous mail more elaborative.
           What I tried is as follows.

           I assigned a parallel task taking a few minutes
           (matrix multiplication of order 2048) to
           two machines connected through Ethernet.
           while the multiplication was going on
           I pulled off the ethernet cable.
           This resulted in infinite waiting of the mpirun.
           I was in need of mechanism to find the failure
           link.

           So, I tried to run mpirun with mca parameter
           -heartbeat-rate 1.
           Now mpirun was able to be aware of the link failure
           and aborted after dumping ip of the non reachable
           node on terminal.

           At this point I have to catch this fault
           and instead of displaying the error message on screen
           and aborting the whole job.
           I need to reassign the task to some
           reachable node.

           I hope this time I expressed it clearly.
           Thanks.

With Love
sudheesh

On 1/12/10, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
> On Jan 6, 2010, at 9:04 AM, Sai Sudheesh wrote:
>
>> Hi,
>>
>> Just about two months ago I started experimenting with OpenMPI.
>> I found this piece of software very interesting.
>>
>> How can I make this software fault tolerant?
>
> Depends on what you mean my fault tolerant. :)
>
>> As of now I am running this software on two machines
>> having quad core processors and fedora 10.
>> I am using openmpi1.3.2.
>>
>> If a remote machine fails while a parallel task running on both
>> the machines
>> is it possible to reassign that task assigned to it to some
>> other node available and
>> continue the computation instead of aborting the entire
>> computation?
>
> This scenario is currently not supported by Open MPI. If an MPI
> process fails, Open MPI will cleanup the job.
>
> A few of us have been working on this scenario off-trunk for a while
> now. It is progressing nicely, but not available for public
> consumption just yet.
>
>
>> Can anybody tell me where I have to look for more information
>> regarding this.
>> I have tried with FT MPI but tired of it.
>
> FT-MPI should be able to work in this scenario.
>
>> I have also heard of CIFTS-FTB, can I use for solving this?
>
> The CIFTS FTB is focused on a slightly different problem, that of
> coordination amongst software components before/during/after a
> failure. Currently, Open MPI is able to interact with the CIFTS FTB to
> send fault information. Soon, Open MPI will be able to respond to such
> fault information and take appropriate actions. The first generation
> of this work is scheduled to be brought into the Open MPI trunk soon,
> and will support catching of some basic events. Handling the scenario
> you mentioned at the top of the message will come shortly thereafter.
>
>> Is it necessary to make a source code change?
>
> In some cases yes, in others no. It really depends on what the final
> solution set looks like and how involved your application wants to be
> in the recovery process. At the very least, the application will
> likely have to specify the MPI_ERRORS_RETURN error handler for each
> communicator to override the default MPI_ERRORS_ARE_FATAL.
>
>
>> Have anybody a solution already with you?
>
> There are a couple of transparent fault tolerance solutions in the
> current trunk.
> - Checkpoint/Restart of the entire MPI job (requires full job
> restart on failure)
> http://www.osl.iu.edu/research/ft/ompi-cr/
> - Message Logging:
> https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR
>
> For non-MPI jobs you could also check out the Open Resilient Cluster
> Manager (ORCM) project:
> http://www.open-mpi.org/projects/orcm/
>
>>
>> If an application is killed by OS at the remote node
>> mpirun is aborting and reports an error.
>> What kind of signal the remote orted is to mpirun?
>> How can I handle it?
>
> I'm not sure what your asking here. The orted detects the local
> process failure and notifies the mpirun process using the OOB (out-of-
> band) communication channel. The mpirun process then initiates the
> shutdown procedure.
>
> -- Josh
>
>>
>> I know that I have asked a lot of questions..
>> I will be thankful to you If anybody could respond with
>> at least some suggestions.
>>
>> with love
>> sudheesh.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
regards
sai sudheesh