Thanks. I've read your (Joshua Hersey's) Ph.D. thesis on fault tolerance using checkpointing with much interest. It would be of further interest to get the range of possible user requirements for defining the behaviors in response to various faults.
On Fri, 2011-04-22 at 15:03 -0400, Joshua Hursey wrote:
On Apr 22, 2011, at 1:20 PM, N.M. Maclaren wrote:
> On Apr 22 2011, Ralph Castain wrote:
>> Several of us are. Josh and George (plus teammates), and some other outside folks, are working the MPI side of it.
>> I'm working only the ORTE side of the problem.
>> Quite a bit of capability is already in the trunk, but there is always more to do :-)
> Is there a specification of what objectives are covered by 'fault-tolerant'?
We do not really have a website to point folks to at the moment. Some of the existing functionally in and planned functionality for Open MPI has been announced and documented, but not uniformly or in a central place at the moment. We have a developers meeting in a couple weeks and this is a topic I am planning on covering:
Once something is available, we'll post to the users/developers lists so that people know where to look for developments.
I am responsible for two fault tolerance features in Open MPI: Checkpoint/Restart and MPI Forum's Fault Tolerance Working Group proposals. The Checkpoint/Restart support is documented here:
Most of my attention is focused on the MPI Forum's Fault Tolerance Working Group proposals that are focused on enabling fault tolerant applications to be developed on top of MPI (so non-transparent fault tolerance). The Open MPI prototype is not yet publicly available, but soon. Information about the semantics and interfaces of that project can be found at the links below:
That is what I have been up to regarding fault tolerance. Others can probably elaborate on what they are working on if they wish.
> Nick Maclaren.
> devel mailing list
devel mailing list