Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fault tolerance
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-03-06 13:38:43


Aside of what Josh said, we are working right know at UTK on orted/MPI
recovery (without killing/respawning all). For now we had no use of
the errgmr, but I'm quite sure this would be the smartest place to
put all the mechanisms we are trying now.

Aurelien
Le 6 mars 08 à 11:17, Ralph Castain a écrit :

> Ah - ok, thanks for clarifying! I'm happy to leave it around, but
> wasn't
> sure if/where it fit into anyone's future plans.
>
> Thanks
> Ralph
>
>
>
> On 3/6/08 9:13 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>> The checkpoint/restart work that I have integrated does not respond
>> to
>> failure at the moment. If a failures happens I want ORTE to terminate
>> the entire job. I will then restart the entire job from a checkpoint
>> file. This follows the 'all fall down' approach that users typically
>> expect when using a global C/R technique.
>>
>> Eventually I want to integrate something better where I can respond
>> to
>> a failure with a recovery from inside ORTE. I'm not there yet, but
>> hopefully in the near future.
>>
>> I'll let the UTK group talk about what they are doing with ORTE,
>> but I
>> suspect they will be taking advantage of the errmgr to help respond
>> to
>> failure and restart a single process.
>>
>>
>> It is important to consider in this context that we do *not* always
>> want ORTE to abort whenever it detects a process failure. This is the
>> default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
>> be supported. But there is another mode in which we would like ORTE
>> to
>> keep running to conform with (MPI_ERRORS_RETURN):
>> http://www.mpi-forum.org/docs/mpi-11-html/node148.html
>>
>> It is known that certain standards conformant MPI "fault tolerant"
>> programs do not work in Open MPI for various reasons some in the
>> runtime and some external. Here we are mostly talking about
>> disconnected fates of intra-communicator groups. I have a test in the
>> ompi-tests repository that illustrates this problem, but I do not
>> have
>> time to fix it at the moment.
>>
>>
>> So in short keep the errmgr around for now. I suspect we will be
>> using
>> it, and possibly tweaking it in the nearish future.
>>
>> Thanks for the observation.
>>
>> Cheers,
>> Josh
>>
>> On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:
>>
>>> Hello
>>>
>>> I've been doing some work on fault response within the system, and
>>> finally
>>> realized something I should probably have seen awhile back. Perhaps
>>> I am
>>> misunderstanding somewhere, so forgive the ignorance if so.
>>>
>>> When we designed ORTE some time in the deep, dark past, we had
>>> envisioned
>>> that people might want multiple ways of responding to process faults
>>> and/or
>>> abnormal terminations. You might want to just abort the job, attempt
>>> to
>>> restart just that proc, attempt to restart the job, etc. To support
>>> these
>>> multiple options, and to provide a means for people to simply try
>>> new ones,
>>> we created the errmgr framework.
>>>
>>> Our thought was that a process and/or daemon would call the errmgr
>>> when we
>>> detected something abnormal happening, and that the selected errmgr
>>> component could then do whatever fault response was desired.
>>>
>>> However, I now see that the fault tolerance mechanisms inside of
>>> OMPI do not
>>> seem to be using that methodology. Instead, we have hard-coded a
>>> particular
>>> response into the system.
>>>
>>> If we configure without FT, we just abort the entire job since that
>>> is the
>>> only errmgr component that exists.
>>>
>>> If we configure with FT, then we execute the hard-coded C/R
>>> methodology.
>>> This is built directly into the code, so there is no option as to
>>> what
>>> happens.
>>>
>>> Is there a reason why the errmgr framework was not used? Did the FT
>>> team
>>> decide that this was not a useful tool to support multiple FT
>>> strategies?
>>> Can we modify it to better serve those needs, or is it simply not
>>> feasible?
>>>
>>> If it isn't going to be used for that purpose, then I might as well
>>> remove
>>> it. As things stand, there really is no purpose served by the errmgr
>>> framework - might as well replace it with just a function call.
>>>
>>> Appreciate any insights
>>> Ralph
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel