Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Checkpoint/restart question
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-26 08:44:41

I have not played with the Condor checkpoint/restart library in quite some time. Supporting it should be fairly straight forward though (though the devil is always in the details with such things).

In Open MPI, all of the code to support checkpoint/restart services like BLCR or condor is part of a framework component in OPAL called CRS (for Checkpoint/Restart Service). To support a new checkpointer you will need to develop a new component under opal/mca/crs/. If you are (or someone you know is) interested in doing the development, you should be able to use the BLCR module to help guide you through the details.

This integration would allow you to use all of the Open MPI's current C/R infrastructure just with the Condor C/R library capturing the per process checkpoints. Storage and coordination is handled in other frameworks in the Open MPI environment so you should not need to worry about that at this level.

If you have any questions let me know and I can try to help you navigate the code base.

-- Josh

On Aug 25, 2010, at 7:36 PM, Tomas Oppelstrup wrote:

> Hi,
> I have a question about checkpoint-restart operation with opem-mpi. I
> hope this is an apropriate forum for my question.
> I do not have access to recopmile the kernel or load kernel modules,
> so I would like to use the condor checkpoint-restart library. Can
> that me made to work with openmpi's checkpoint-restart
> infrastructure?
> The condor library, upon recept of a signal or calling its checkpoint
> function from within the program, generates a file containing the
> complete (as complete as possible) state of the process, including
> the state of libraries, e.g. openmpi. On restart, the process
> image/state is loaded into memory and execution is resumed at the
> checkpoint location.
> On restart, I assume that some information in the mpi-state may need
> to be reinitalized, since e.g. the names of the hosts of the
> mpi-process, and pids of possible support processes will have
> changed.
> Is this tricky to fix (that code must somehow be there for the BLCR
> compatibility)?
> Perhaps it can be achieved by (in violation of the mpi-standard)
> calling MPI_Finalize before the checkpoint, and MPI_Init after
> restart? This seems like a conceptually appealing solution, but may
> not be allowed nor to the correct thing in openmpi?!
> Thanks for any ideas/help/pointers to more information!
> Tomas
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory