Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] DMTCP: Checkpoint-Restart solution for OpenMPI
From: Kapil Arya (kapil_at_[hidden])
Date: 2010-01-31 22:39:37


Hi All,

As of January 29, 2010, we recently produced a new release (1.1.3) of
DMTCP (Distributed MultiThreaded CheckPointing). Its web page is at
http://dmtcp.sourceforge.net/ . We (the developers of DMTCP) have
tried to carefully test this this version of DMTCP on OpenMPI 1.4.1,
and we believe it to be working well. We would welcome feedback from
any OpenMPI users who would care to test it on their own applications.

The DMTCP package provides an alternative solution for
checkpoint-restart of OpenMPI computations. Using it is as simple as:
 dmtcp_checkpoint dmtcp_checkpoint mpirun ./hello_mpi
 # Manually checkpoint from any other terminal
 dmtcp_command --checkpoint
 # Execute restart script, which invokes ckpt images that were generated.
 ./dmtcp_restart_script.sh

DMTCP works by creating a separate, stateless checkpoint coordinator,
independent of OpenMPI's orterun. All OpenMPI processes are then
checkpointed, including orterun. At restart time, a new DMTCP
checkpoint coordinator can be used. DMTCP is transparent and runs
entirely in user space. There is no modification to the MPI
application binary, nor to OpenMPI nor to the operating system kernel.

DMTCP also supports a dmtcpaware interface (application-initiated
checkpoints), and numerous other features. At this time, DMTCP
supports only the use of Ethernet (TCP/IP) and shared memory for
transport. We are looking at supporting the Infiniband transport layer
in the future.

Finally, a bit of history. DMTCP began with a goal of checkpointing
distributed desktop applications. We recognize thefine
checkpoint-restart solution that already exists in OpenMPI:
checkpoint-restart service on top of BLCR. We offer DMTCP as an
alternative for some unusual situations, such as when the end user
does not have privilege to add the BLCR kernel module. We are eager
to gain feedback from the OpenMPI community.

Thanks,
DMTCP Developers