Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fault tolerance
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-03-07 11:59:43


We now use the errmgr.

Aurelien

Le 6 mars 08 à 13:38, Aurélien Bouteiller a écrit :

> Aside of what Josh said, we are working right know at UTK on orted/MPI
> recovery (without killing/respawning all). For now we had no use of
> the errgmr, but I'm quite sure this would be the smartest place to
> put all the mechanisms we are trying now.
>
> Aurelien
> Le 6 mars 08 à 11:17, Ralph Castain a écrit :
>
>> Ah - ok, thanks for clarifying! I'm happy to leave it around, but
>> wasn't
>> sure if/where it fit into anyone's future plans.
>>
>> Thanks
>> Ralph
>>
>>
>>
>> On 3/6/08 9:13 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>
>>> The checkpoint/restart work that I have integrated does not respond
>>> to
>>> failure at the moment. If a failures happens I want ORTE to
>>> terminate
>>> the entire job. I will then restart the entire job from a checkpoint
>>> file. This follows the 'all fall down' approach that users typically
>>> expect when using a global C/R technique.
>>>
>>> Eventually I want to integrate something better where I can respond
>>> to
>>> a failure with a recovery from inside ORTE. I'm not there yet, but
>>> hopefully in the near future.
>>>
>>> I'll let the UTK group talk about what they are doing with ORTE,
>>> but I
>>> suspect they will be taking advantage of the errmgr to help respond
>>> to
>>> failure and restart a single process.
>>>
>>>
>>> It is important to consider in this context that we do *not* always
>>> want ORTE to abort whenever it detects a process failure. This is
>>> the
>>> default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
>>> be supported. But there is another mode in which we would like ORTE
>>> to
>>> keep running to conform with (MPI_ERRORS_RETURN):
>>> http://www.mpi-forum.org/docs/mpi-11-html/node148.html
>>>
>>> It is known that certain standards conformant MPI "fault tolerant"
>>> programs do not work in Open MPI for various reasons some in the
>>> runtime and some external. Here we are mostly talking about
>>> disconnected fates of intra-communicator groups. I have a test in
>>> the
>>> ompi-tests repository that illustrates this problem, but I do not
>>> have
>>> time to fix it at the moment.
>>>
>>>
>>> So in short keep the errmgr around for now. I suspect we will be
>>> using
>>> it, and possibly tweaking it in the nearish future.
>>>
>>> Thanks for the observation.
>>>
>>> Cheers,
>>> Josh
>>>
>>> On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:
>>>
>>>> Hello
>>>>
>>>> I've been doing some work on fault response within the system, and
>>>> finally
>>>> realized something I should probably have seen awhile back. Perhaps
>>>> I am
>>>> misunderstanding somewhere, so forgive the ignorance if so.
>>>>
>>>> When we designed ORTE some time in the deep, dark past, we had
>>>> envisioned
>>>> that people might want multiple ways of responding to process
>>>> faults
>>>> and/or
>>>> abnormal terminations. You might want to just abort the job,
>>>> attempt
>>>> to
>>>> restart just that proc, attempt to restart the job, etc. To support
>>>> these
>>>> multiple options, and to provide a means for people to simply try
>>>> new ones,
>>>> we created the errmgr framework.
>>>>
>>>> Our thought was that a process and/or daemon would call the errmgr
>>>> when we
>>>> detected something abnormal happening, and that the selected errmgr
>>>> component could then do whatever fault response was desired.
>>>>
>>>> However, I now see that the fault tolerance mechanisms inside of
>>>> OMPI do not
>>>> seem to be using that methodology. Instead, we have hard-coded a
>>>> particular
>>>> response into the system.
>>>>
>>>> If we configure without FT, we just abort the entire job since that
>>>> is the
>>>> only errmgr component that exists.
>>>>
>>>> If we configure with FT, then we execute the hard-coded C/R
>>>> methodology.
>>>> This is built directly into the code, so there is no option as to
>>>> what
>>>> happens.
>>>>
>>>> Is there a reason why the errmgr framework was not used? Did the FT
>>>> team
>>>> decide that this was not a useful tool to support multiple FT
>>>> strategies?
>>>> Can we modify it to better serve those needs, or is it simply not
>>>> feasible?
>>>>
>>>> If it isn't going to be used for that purpose, then I might as well
>>>> remove
>>>> it. As things stand, there really is no purpose served by the
>>>> errmgr
>>>> framework - might as well replace it with just a function call.
>>>>
>>>> Appreciate any insights
>>>> Ralph
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel