As of January 29, 2010, we recently produced a new release (1.1.3) of
DMTCP (Distributed MultiThreaded CheckPointing). Its web page is at
http://dmtcp.sourceforge.net/ . We (the developers of DMTCP) have
tried to carefully test this this version of DMTCP on OpenMPI 1.4.1,
and we believe it to be working well. We would welcome feedback from
any OpenMPI users who would care to test it on their own applications.
The DMTCP package provides an alternative solution for
checkpoint-restart of OpenMPI computations. Using it is as simple as:
dmtcp_checkpoint dmtcp_checkpoint mpirun ./hello_mpi
# Manually checkpoint from any other terminal
# Execute restart script, which invokes ckpt images that were generated.
DMTCP works by creating a separate, stateless checkpoint coordinator,
independent of OpenMPI's orterun. All OpenMPI processes are then
checkpointed, including orterun. At restart time, a new DMTCP
checkpoint coordinator can be used. DMTCP is transparent and runs
entirely in user space. There is no modification to the MPI
application binary, nor to OpenMPI nor to the operating system kernel.
DMTCP also supports a dmtcpaware interface (application-initiated
checkpoints), and numerous other features. At this time, DMTCP
supports only the use of Ethernet (TCP/IP) and shared memory for
transport. We are looking at supporting the Infiniband transport layer
in the future.
Finally, a bit of history. DMTCP began with a goal of checkpointing
distributed desktop applications. We recognize thefine
checkpoint-restart solution that already exists in OpenMPI:
checkpoint-restart service on top of BLCR. We offer DMTCP as an
alternative for some unusual situations, such as when the end user
does not have privilege to add the BLCR kernel module. We are eager
to gain feedback from the OpenMPI community.