Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] fault tolerance in open mpi
From: vipin kumar (vipinkumar41_at_[hidden])
Date: 2009-12-24 02:21:36


Dear all,

May I help in this context ? I can't promise to do big things or high
availability in this regard, because I may get more busy in my work.
And also I am not sure that my
company will allow this or not. Any way I may do this in my spare time.

Thanks & Regards,

On 12/23/09, Ralph Castain <rhc_at_[hidden]> wrote:
> That's just OMPI's default behavior - as Josh said, we are working towards
> allowing other behaviors, but for now, this is what we have.
>
>
> On Dec 23, 2009, at 5:40 AM, vipin kumar wrote:
>
>> Thank you Ralph,
>>
>> I did as you said. Programs are running fine, But still killing one
>> process leads to terminate all processes. Am I missing something? Any
>> thing else to be called with MPI::Comm::Disconnect()?
>>
>> Thanks & Regards,
>>
>> On Mon, Dec 21, 2009 at 8:00 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Disconnect is a -collective- operation. Both parent and child have to call
>> it. Your child process is "hanging" while it waits for the parent.
>>
>> On Dec 21, 2009, at 1:37 AM, vipin kumar wrote:
>>
>>> Hello folks,
>>>
>>> As I explained my problem earlier, I am looking for Fault Tolerance in
>>> MPI Programs. I read in Open MPI 2.1 standard document that two
>>> DISCONNECTED processes does not affect each other, i.e. they can die or
>>> can be killed without whithout affecting other processes.
>>>
>>> So, I was trying this to achieve fault tolerance using
>>> MPI::Comm::Disconnect() to disconnect the CHILD process with PARENT
>>> process, which was spawned by calling MPI::Comm::spawn(). I am calling
>>> MPI::Comm::Disconnect() from CHILD process immediatly after calling
>>> MPI::Init(). It seems that CHILD process is not returning from this call.
>>>
>>>
>>> I tried MPI::Comm::Free() too, but this is also not working. Process is
>>> not progressing from this point of call. If I comment these statements,
>>> everything works fine. Note that I have tried this in Solaris as well as
>>> in Linux (fedora core).
>>>
>>> My question is, whether Open-mpi suports to disconnect two processes(
>>> like child from parent). And if it is, then how?
>>>
>>>
>>> Thanks & Regards,
>>>
>>> On Wed, Sep 23, 2009 at 6:41 PM, Josh Hursey <jjhursey_at_[hidden]>
>>> wrote:
>>> Unfortunately I cannot provide a precise time frame for availability at
>>> this point, but we are targeting the v1.5 release series. There is a
>>> handful of core developers working on this issue at the moment. Pieces of
>>> this work have already made it into the Open MPI development trunk. If
>>> you want to play around with what is available try turning on the
>>> resilient mapper:
>>> -mca rmaps resilient
>>>
>>> We will be sure to email the list once this work becomes more stable and
>>> available.
>>>
>>> -- Josh
>>>
>>>
>>> On Sep 18, 2009, at 2:56 AM, vipin kumar wrote:
>>>
>>> Hi Josh,
>>>
>>> It is good to hear from you that work is in progress towards resiliency
>>> of Open-MPI. I was and I am waiting for this capability in Open-MPI. I
>>> have almost finished my development work and waiting for this to happen
>>> so that I can test my programs. It will be good if you can tell how long
>>> it will take to make Open-MPI a resilient impementation. Here by
>>> resiliency I mean abnormal termination or intentionally killing a process
>>> should not cause any(parent or sibling) process to be terminated, given
>>> that processes are connected.
>>>
>>> thanks.
>>>
>>> Regards,
>>>
>>> On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhursey_at_[hidden]>
>>> wrote:
>>> Task-farm or manager/worker recovery models typically depend on
>>> intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI
>>> implementation. William Gropp and Ewing Lusk have a paper entitled "Fault
>>> Tolerance in MPI Programs" that outlines how an application might take
>>> advantage of these features in order to recover from process failure.
>>>
>>> However, these techniques strongly depend upon resilient MPI
>>> implementations, and behaviors that, some may argue, are non-standard.
>>> Unfortunately there are not many MPI implementations that are
>>> sufficiently resilient in the face of process failure to support failure
>>> in task-farm scenarios. Though Open MPI supports the current MPI 2.1
>>> standard, it is not as resilient to process failure as it could be.
>>>
>>> There are a number of people working on improving the resiliency of Open
>>> MPI in the face of network and process failure (including myself). We
>>> have started to move some of the resiliency work into the Open MPI trunk.
>>> Resiliency in Open MPI has been improving over the past few months, but I
>>> would not assess it as ready quite yet. Most of the work has focused on
>>> the runtime level (ORTE), and there are still some MPI level (OMPI)
>>> issues that need to be worked out.
>>>
>>> With all of that being said, I would try some of the techniques presented
>>> in the Gropp/Lusk paper in your application. Then test it with Open MPI
>>> and let us know how it goes.
>>>
>>> Best,
>>> Josh
>>>
>>>
>>> On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:
>>>
>>> Is that kind of approach possible within an MPI framework? Perhaps a
>>> grid approach would be better. More experienced people, speak up,
>>> please?
>>> (The reason I say that is that I too am interested in the solution of
>>> that kind of problem, where an individual blade of a blade server
>>> fails and correcting for that failure on the fly is better than taking
>>> checkpoints and restarting the whole process excluding the failed
>>> blade.
>>>
>>> Durga
>>>
>>> On Mon, Aug 3, 2009 at 9:21 AM, jody<jody.xha_at_[hidden]> wrote:
>>> Hi
>>>
>>> I guess "task-farming" could give you a certain amount of the kind of
>>> fault-tolerance you want.
>>> (i.e. a master process distributes tasks to idle slave processors -
>>> however, this will only work
>>> if the slave processes don't need to communicate with each other)
>>>
>>> Jody
>>>
>>>
>>> On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkumar41_at_[hidden]>
>>> wrote:
>>> Hi all,
>>>
>>> Thanks Durga for your reply.
>>>
>>> Jeff, once you wrote code for Mandelbrot set to demonstrate fault
>>> tolerance
>>> in LAM-MPI. i. e. killing any slave process doesn't
>>> affect others. Exact behaviour I am looking for in Open MPI. I attempted,
>>> but no luck. Can you please tell how to write such programs in Open MPI.
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpchoudh_at_[hidden]>
>>> wrote:
>>>
>>> Although I have perhaps the least experience on the topic in this
>>> list, I will take a shot; more experienced people, please correct me:
>>>
>>> MPI standards specify communication mechanism, not fault tolerance at
>>> any level. You may achieve network tolerance at the IP level by
>>> implementing 'equal cost multipath' routes (which means two equally
>>> capable NIC cards connecting to the same destination and modifying the
>>> kernel routing table to use both cards; the kernel will dynamically
>>> load balance.). At the MAC level, you can achieve the same effect by
>>> trunking multiple network cards.
>>>
>>> You can achieve process level fault tolerance by a checkpointing
>>> scheme such as BLCR, which has been tested to work with OpenMPI (and
>>> other processes as well)
>>>
>>> Durga
>>>
>>> On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkumar41_at_[hidden]>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I want to know whether open mpi supports Network and process fault
>>> tolerance
>>> or not? If there is any example demonstrating these features that will
>>> be
>>> best.
>>>
>>> Regards,
>>> --
>>> Vipin K.
>>> Research Engineer,
>>> C-DOTB, India
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> Vipin K.
>>> Research Engineer,
>>> C-DOTB, India
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> Vipin K.
>>> Research Engineer,
>>> C-DOTB, India
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> Vipin K.
>>> Research Engineer,
>>> C-DOTB, India
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> Vipin K.
>> Research Engineer,
>> C-DOTB, India
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Vipin K.
Research Engineer,
C-DOTB, India