Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: CRS Module for MTCP Checkpointing Package
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-10-06 16:33:46


Alex,

It looks like there is a mismatch between what you propose to achieve and the text in your RFC. You propose to add a new single-process checkpoint-restart mechanism (MTCP), to the ones already provided in Open MPI. However, most of the text in your RFC is about DMTCP, which is another layer on top of MTCP capable of checkpoint/restarting distributed application.

I would like to understand what this RFC is really about: MTCP or DMTCP?

  george.

On Oct 6, 2011, at 02:58 , Alex Brick wrote:

> WHAT: Bring in the mtcp CRS component
>
> WHY: Add support for the MTCP checkpoint/restart service
>
> WHERE: opal/mca/crs/mtcp
>
> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
>
> -------------------------------------------
> What is MTCP?
>
> DMTCP (Distributed MultiThreaded CheckPointing, http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing package that has been under development for seven years. It operates entirely in user space, with no kernel modules, or modifications to the target application. If used in the simplest possible way, it works as:
>
> dmtcp_checkpoint ./a.out
> dmtcp_command --checkpoint
> dmtcp_restart ckpt_a.out_*.dmtcp
>
> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh",
> are recognized by DMTCP, and it maintains those threads, and local and
> remote processes under checkpoint control. At checkpoint time, it also
> generates a script, dmtcp_restart_script.sh, that can restart a distributed computation. As a sign of its maturity, it can also checkpoint Open MPI "from on top": dmtcp_checkpoint mpirun hello_mpi
>
> The MTCP component of DMTCP is the single-process component. It is used
> both internally by DMTCP as well as directly by users only interested in
> checkpointing a single process. This second feature was used in order to develop an Open MPI module for the Open MPI checkpoint-restart service similar to BLCR, except that no kernel modules are required.
>
> DMTCP is currently a Debian package (Debian testing), and is planned also for Fedora and openSuSe. These packages also provide the MTCP component for Open MPI.
>
> -------------------------------------------
> More details:
>
> Open MPI MTCP integration implementation available at:
>
> https://bitbucket.org/jsquyres/ompi-dmtcp2
>
> The DMTCP parent project website is below:
>
> http://dmtcp.sourceforge.net/
>
> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports user-level, transparent checkpoint/restart of a variety of sequential and parallel programs. In Open MPI terms, this contribution is an alternative to the BLCR CRS module, meaning that users can use DMTCP to checkpoint their applications instead of BLCR.
>
> The MTCP component is currently restricted to supporting communication over sockets and shared memory. In an effort to support a wider range of networks (e.g., InfiniBand, Myrinet), they have created a CRS component to hook into Open MPI's checkpoint/restart infrastructure. The MTCP user-level checkpoint/restart service is the single process checkpoint kernel of the DMTCP project. The MTCP kernel is what is used in the mtcp CRS component.
>
> Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based out of the US Northeastern University in Boston, MA, USA) for quite a while and feel that this component is ready to be brought into the Open MPI main line for inclusion in the 1.7.x series (and possibly the 1.5.x series?). The authors have submitted OMPI 3rd party contribution agreements.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel