Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] fault tolerance in open mpi
From: vipin kumar (vipinkumar41_at_[hidden])
Date: 2009-09-18 02:56:39

Hi Josh,

It is good to hear from you that work is in progress towards resiliency of
Open-MPI. I was and I am waiting for this capability in Open-MPI. I have
almost finished my development work and waiting for this to happen so that I
can test my programs. It will be good if you can tell how long it will take
to make Open-MPI a resilient impementation. Here by resiliency I mean
abnormal termination or intentionally killing a process should not cause
any(parent or sibling) process to be terminated, given that processes are



On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> Task-farm or manager/worker recovery models typically depend on
> intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI
> implementation. William Gropp and Ewing Lusk have a paper entitled "Fault
> Tolerance in MPI Programs" that outlines how an application might take
> advantage of these features in order to recover from process failure.
> However, these techniques strongly depend upon resilient MPI
> implementations, and behaviors that, some may argue, are non-standard.
> Unfortunately there are not many MPI implementations that are sufficiently
> resilient in the face of process failure to support failure in task-farm
> scenarios. Though Open MPI supports the current MPI 2.1 standard, it is not
> as resilient to process failure as it could be.
> There are a number of people working on improving the resiliency of Open
> MPI in the face of network and process failure (including myself). We have
> started to move some of the resiliency work into the Open MPI trunk.
> Resiliency in Open MPI has been improving over the past few months, but I
> would not assess it as ready quite yet. Most of the work has focused on the
> runtime level (ORTE), and there are still some MPI level (OMPI) issues that
> need to be worked out.
> With all of that being said, I would try some of the techniques presented
> in the Gropp/Lusk paper in your application. Then test it with Open MPI and
> let us know how it goes.
> Best,
> Josh
> On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:
> Is that kind of approach possible within an MPI framework? Perhaps a
>> grid approach would be better. More experienced people, speak up,
>> please?
>> (The reason I say that is that I too am interested in the solution of
>> that kind of problem, where an individual blade of a blade server
>> fails and correcting for that failure on the fly is better than taking
>> checkpoints and restarting the whole process excluding the failed
>> blade.
>> Durga
>> On Mon, Aug 3, 2009 at 9:21 AM, jody<jody.xha_at_[hidden]> wrote:
>>> Hi
>>> I guess "task-farming" could give you a certain amount of the kind of
>>> fault-tolerance you want.
>>> (i.e. a master process distributes tasks to idle slave processors -
>>> however, this will only work
>>> if the slave processes don't need to communicate with each other)
>>> Jody
>>> On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkumar41_at_[hidden]>
>>> wrote:
>>>> Hi all,
>>>> Thanks Durga for your reply.
>>>> Jeff, once you wrote code for Mandelbrot set to demonstrate fault
>>>> tolerance
>>>> in LAM-MPI. i. e. killing any slave process doesn't
>>>> affect others. Exact behaviour I am looking for in Open MPI. I
>>>> attempted,
>>>> but no luck. Can you please tell how to write such programs in Open MPI.
>>>> Thanks in advance.
>>>> Regards,
>>>> On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpchoudh_at_[hidden]>
>>>> wrote:
>>>>> Although I have perhaps the least experience on the topic in this
>>>>> list, I will take a shot; more experienced people, please correct me:
>>>>> MPI standards specify communication mechanism, not fault tolerance at
>>>>> any level. You may achieve network tolerance at the IP level by
>>>>> implementing 'equal cost multipath' routes (which means two equally
>>>>> capable NIC cards connecting to the same destination and modifying the
>>>>> kernel routing table to use both cards; the kernel will dynamically
>>>>> load balance.). At the MAC level, you can achieve the same effect by
>>>>> trunking multiple network cards.
>>>>> You can achieve process level fault tolerance by a checkpointing
>>>>> scheme such as BLCR, which has been tested to work with OpenMPI (and
>>>>> other processes as well)
>>>>> Durga
>>>>> On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkumar41_at_[hidden]>
>>>>> wrote:
>>>>>> Hi all,
>>>>>> I want to know whether open mpi supports Network and process fault
>>>>>> tolerance
>>>>>> or not? If there is any example demonstrating these features that will
>>>>>> be
>>>>>> best.
>>>>>> Regards,
>>>>>> --
>>>>>> Vipin K.
>>>>>> Research Engineer,
>>>>>> C-DOTB, India
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>> --
>>>> Vipin K.
>>>> Research Engineer,
>>>> C-DOTB, India
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]

Vipin K.
Research Engineer,
C-DOTB, India