Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] fault tolerance in open mpi
From: vipin kumar (vipinkumar41_at_[hidden])
Date: 2009-12-21 03:37:37


Hello folks,

As I explained my problem earlier, I am looking for Fault Tolerance in MPI
Programs. I read in Open MPI 2.1 standard document that two DISCONNECTED
processes does not affect each other, i.e. they can die or can be killed
without whithout affecting other processes.

So, I was trying this to achieve fault tolerance using
MPI::Comm::Disconnect() to disconnect the CHILD process with PARENT process,
which was spawned by calling MPI::Comm::spawn(). I am calling
MPI::Comm::Disconnect() from CHILD process immediatly after calling
MPI::Init(). It seems that CHILD process is not returning from this call.

I tried MPI::Comm::Free() too, but this is also not working. Process is not
progressing from this point of call. If I comment these statements,
everything works fine. Note that I have tried this in Solaris as well as in
Linux (fedora core).

My question is, whether Open-mpi suports to disconnect two processes( like
child from parent). And if it is, then how?

Thanks & Regards,

On Wed, Sep 23, 2009 at 6:41 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> Unfortunately I cannot provide a precise time frame for availability at
> this point, but we are targeting the v1.5 release series. There is a handful
> of core developers working on this issue at the moment. Pieces of this work
> have already made it into the Open MPI development trunk. If you want to
> play around with what is available try turning on the resilient mapper:
> -mca rmaps resilient
>
> We will be sure to email the list once this work becomes more stable and
> available.
>
> -- Josh
>
>
> On Sep 18, 2009, at 2:56 AM, vipin kumar wrote:
>
> Hi Josh,
>>
>> It is good to hear from you that work is in progress towards resiliency of
>> Open-MPI. I was and I am waiting for this capability in Open-MPI. I have
>> almost finished my development work and waiting for this to happen so that I
>> can test my programs. It will be good if you can tell how long it will take
>> to make Open-MPI a resilient impementation. Here by resiliency I mean
>> abnormal termination or intentionally killing a process should not cause
>> any(parent or sibling) process to be terminated, given that processes are
>> connected.
>>
>> thanks.
>>
>> Regards,
>>
>> On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhursey_at_[hidden]>
>> wrote:
>> Task-farm or manager/worker recovery models typically depend on
>> intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI
>> implementation. William Gropp and Ewing Lusk have a paper entitled "Fault
>> Tolerance in MPI Programs" that outlines how an application might take
>> advantage of these features in order to recover from process failure.
>>
>> However, these techniques strongly depend upon resilient MPI
>> implementations, and behaviors that, some may argue, are non-standard.
>> Unfortunately there are not many MPI implementations that are sufficiently
>> resilient in the face of process failure to support failure in task-farm
>> scenarios. Though Open MPI supports the current MPI 2.1 standard, it is not
>> as resilient to process failure as it could be.
>>
>> There are a number of people working on improving the resiliency of Open
>> MPI in the face of network and process failure (including myself). We have
>> started to move some of the resiliency work into the Open MPI trunk.
>> Resiliency in Open MPI has been improving over the past few months, but I
>> would not assess it as ready quite yet. Most of the work has focused on the
>> runtime level (ORTE), and there are still some MPI level (OMPI) issues that
>> need to be worked out.
>>
>> With all of that being said, I would try some of the techniques presented
>> in the Gropp/Lusk paper in your application. Then test it with Open MPI and
>> let us know how it goes.
>>
>> Best,
>> Josh
>>
>>
>> On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:
>>
>> Is that kind of approach possible within an MPI framework? Perhaps a
>> grid approach would be better. More experienced people, speak up,
>> please?
>> (The reason I say that is that I too am interested in the solution of
>> that kind of problem, where an individual blade of a blade server
>> fails and correcting for that failure on the fly is better than taking
>> checkpoints and restarting the whole process excluding the failed
>> blade.
>>
>> Durga
>>
>> On Mon, Aug 3, 2009 at 9:21 AM, jody<jody.xha_at_[hidden]> wrote:
>> Hi
>>
>> I guess "task-farming" could give you a certain amount of the kind of
>> fault-tolerance you want.
>> (i.e. a master process distributes tasks to idle slave processors -
>> however, this will only work
>> if the slave processes don't need to communicate with each other)
>>
>> Jody
>>
>>
>> On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkumar41_at_[hidden]>
>> wrote:
>> Hi all,
>>
>> Thanks Durga for your reply.
>>
>> Jeff, once you wrote code for Mandelbrot set to demonstrate fault
>> tolerance
>> in LAM-MPI. i. e. killing any slave process doesn't
>> affect others. Exact behaviour I am looking for in Open MPI. I attempted,
>> but no luck. Can you please tell how to write such programs in Open MPI.
>>
>> Thanks in advance.
>>
>> Regards,
>> On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpchoudh_at_[hidden]>
>> wrote:
>>
>> Although I have perhaps the least experience on the topic in this
>> list, I will take a shot; more experienced people, please correct me:
>>
>> MPI standards specify communication mechanism, not fault tolerance at
>> any level. You may achieve network tolerance at the IP level by
>> implementing 'equal cost multipath' routes (which means two equally
>> capable NIC cards connecting to the same destination and modifying the
>> kernel routing table to use both cards; the kernel will dynamically
>> load balance.). At the MAC level, you can achieve the same effect by
>> trunking multiple network cards.
>>
>> You can achieve process level fault tolerance by a checkpointing
>> scheme such as BLCR, which has been tested to work with OpenMPI (and
>> other processes as well)
>>
>> Durga
>>
>> On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkumar41_at_[hidden]>
>> wrote:
>>
>> Hi all,
>>
>> I want to know whether open mpi supports Network and process fault
>> tolerance
>> or not? If there is any example demonstrating these features that will
>> be
>> best.
>>
>> Regards,
>> --
>> Vipin K.
>> Research Engineer,
>> C-DOTB, India
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> Vipin K.
>> Research Engineer,
>> C-DOTB, India
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> Vipin K.
>> Research Engineer,
>> C-DOTB, India
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Vipin K.
Research Engineer,
C-DOTB, India