Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] DMTCP: Checkpoint-Restart solution for OpenMPI
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-02-05 17:56:11


On Jan 31, 2010, at 10:39 PM, Kapil Arya wrote:

> DMTCP also supports a dmtcpaware interface (application-initiated
> checkpoints), and numerous other features. At this time, DMTCP
> supports only the use of Ethernet (TCP/IP) and shared memory for
> transport. We are looking at supporting the Infiniband transport layer
> in the future.

It sounds like you have taken a virtualized approach to intercepting network calls, etc. Is that right?

If so, it would be interesting to see what the performance impact is on some of the OS-bypass / high-performance networks. Previous efforts have taken big performance hits and run into interesting challenges (e.g., can't know the state of the hardware NIC, even if you intercept all the function calls).

> Finally, a bit of history. DMTCP began with a goal of checkpointing
> distributed desktop applications. We recognize the fine
> checkpoint-restart solution that already exists in OpenMPI:
> checkpoint-restart service on top of BLCR. We offer DMTCP as an
> alternative for some unusual situations, such as when the end user
> does not have privilege to add the BLCR kernel module. We are eager
> to gain feedback from the OpenMPI community.

Have you looked at our plugin capabilities? BLCR is just a plugin to us -- we can support others. Is it worthwhile / possible to hook your technology in via Open MPI plugins? Josh did some great work to make it pretty extensible.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/