Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: CRS Module for MTCP Checkpointing Package
From: Alex Brick (bricka_at_[hidden])
Date: 2011-10-07 21:44:53


I'm a little unclear on this comment.

DMTCP currently supports checkpointing and restoring sockets over TCP, and we are actively working on Infiniband support. However, we feel that value is added by also working as an Open MPI module, where Open MPI handles all of the network communication, and our module simply handles checkpointing the individual processes. This enables people to use our user-level checkpointing tools with other networks by using Open MPI.

What exactly is your question?

-- Alex

George Bosilca <bosilca_at_[hidden]> wrote:

>Way too much hands waving here.
>
>When you say certain networks you mean TCP and potentially SM. However, I doubt even TCP can be fully supported. Not without the preconnect option … or a mean to update the modes information.
>
> george.
>
>On Oct 7, 2011, at 14:56 , Josh Hursey wrote:
>
>>> From what I have seen during development, this RFC integrates the MTCP
>> single process checkpointer into the C/R infrastructure of Open MPI.
>> The MTCP component of the DMTCP project can be used in insolation,
>> which is what they are integrating. So they can use DMTCP to
>> checkpoint/restart an unmodified Open MPI, but only over certain
>> networks. By integrating the MTCP checkpointer as a CRS component they
>> use Open MPI to coordinate across processes, and gain support for a
>> larger number of networks (e.g., IB, MX).
>>
>> Alex, does that sound about right?
>>
>> -- Josh
>>
>>
>> On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>>> Alex,
>>>
>>> It looks like there is a mismatch between what you propose to achieve and the text in your RFC. You propose to add a new single-process checkpoint-restart mechanism (MTCP), to the ones already provided in Open MPI. However, most of the text in your RFC is about DMTCP, which is another layer on top of MTCP capable of checkpoint/restarting distributed application.
>>>
>>> I would like to understand what this RFC is really about: MTCP or DMTCP?
>>>
>>> george.
>>>
>>> On Oct 6, 2011, at 02:58 , Alex Brick wrote:
>>>
>>>> WHAT: Bring in the mtcp CRS component
>>>>
>>>> WHY: Add support for the MTCP checkpoint/restart service
>>>>
>>>> WHERE: opal/mca/crs/mtcp
>>>>
>>>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
>>>>
>>>> -------------------------------------------
>>>> What is MTCP?
>>>>
>>>> DMTCP (Distributed MultiThreaded CheckPointing, http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing package that has been under development for seven years. It operates entirely in user space, with no kernel modules, or modifications to the target application. If used in the simplest possible way, it works as:
>>>>
>>>> dmtcp_checkpoint ./a.out
>>>> dmtcp_command --checkpoint
>>>> dmtcp_restart ckpt_a.out_*.dmtcp
>>>>
>>>> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh",
>>>> are recognized by DMTCP, and it maintains those threads, and local and
>>>> remote processes under checkpoint control. At checkpoint time, it also
>>>> generates a script, dmtcp_restart_script.sh, that can restart a distributed computation. As a sign of its maturity, it can also checkpoint Open MPI "from on top": dmtcp_checkpoint mpirun hello_mpi
>>>>
>>>> The MTCP component of DMTCP is the single-process component. It is used
>>>> both internally by DMTCP as well as directly by users only interested in
>>>> checkpointing a single process. This second feature was used in order to develop an Open MPI module for the Open MPI checkpoint-restart service similar to BLCR, except that no kernel modules are required.
>>>>
>>>> DMTCP is currently a Debian package (Debian testing), and is planned also for Fedora and openSuSe. These packages also provide the MTCP component for Open MPI.
>>>>
>>>> -------------------------------------------
>>>> More details:
>>>>
>>>> Open MPI MTCP integration implementation available at:
>>>>
>>>> https://bitbucket.org/jsquyres/ompi-dmtcp2
>>>>
>>>> The DMTCP parent project website is below:
>>>>
>>>> http://dmtcp.sourceforge.net/
>>>>
>>>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports user-level, transparent checkpoint/restart of a variety of sequential and parallel programs. In Open MPI terms, this contribution is an alternative to the BLCR CRS module, meaning that users can use DMTCP to checkpoint their applications instead of BLCR.
>>>>
>>>> The MTCP component is currently restricted to supporting communication over sockets and shared memory. In an effort to support a wider range of networks (e.g., InfiniBand, Myrinet), they have created a CRS component to hook into Open MPI's checkpoint/restart infrastructure. The MTCP user-level checkpoint/restart service is the single process checkpoint kernel of the DMTCP project. The MTCP kernel is what is used in the mtcp CRS component.
>>>>
>>>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based out of the US Northeastern University in Boston, MA, USA) for quite a while and feel that this component is ready to be brought into the Open MPI main line for inclusion in the 1.7.x series (and possibly the 1.5.x series?). The authors have submitted OMPI 3rd party contribution agreements.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel