Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process
From: Adrian Reber (adrian_at_[hidden])
Date: 2014-02-18 12:30:02


On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> So when a process is restarted with CRIU, does it resume execution after
> the criu_dump() or somewhere else?

The process is resumed at the same point it was checkpointed with
criu_dump().

> In a continue/leave-running mode after checkpoint the MPI library does not
> need to do quite a much work since we can depend on some things not
> changing (such as the machine name, orted pid, ...).

During criu_dump() nothing changes.

> In a restart mode then the entire library has to be updated - much more
> expensive than the continue mode.

Ah. If I understand you correctly there are C/R methods which require
that the checkpointed process is terminated and needs to be restarted to
continue running. CRIU is completely transparent for the process. It
needs no special environment (LD_PRELOAD) nor any special handling.
criu_dump() pauses the process, checkpoints it and (if desired) lets it
continue in the same state it was before.

> The CRS components that we have supported emerge from their checkpointing
> function (criu_dump in your case) knowing if they are in the continue or
> restart mode. So that CRS function sets the flag according so the rest of
> the library can do the right thing afterwards.

So, I would say CRIU CRS is in continue mode after criu_dump().

> The restart function is called by the opal_restart tool to restart the
> process from an image. Some checkpointers have a library call to restart a
> process others used external tools to do so. So that interface just let's
> the checkpointer decide, given a snapshot image, how it should restart that
> process. The restarted process is assumed to wake up in the
> opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> function name can be a bit misleading.
>
> Does that help?

That helps a lot. Thanks. I am not 100% sure I understand the restart
case, but I will try to implement it and probably then I will understand
how it works.

Would you say, that for the checkpoint only functionality in continue
mode the patch can be checked in?

                Adrian

> On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber <adrian_at_[hidden]> wrote:
>
> > I think I do not understand your question. So far I have only implemented
> > the
> > checkpoint part and not the restart part.
> >
> > Using criu_dump() the process can be left in three different
> > states. Without any special handling the process is dumped and then
> > killed. I can also tell criu to leave the process stopped (--leave-stopped)
> > or running (--leave-running). I decided to default to --leave-running so
> > that after the checkpoint has been performed the process continues
> > running where it stopped.
> >
> > What would be the difference between 'being restarted versus continuing
> > after checkpointing'? Right now only 'continuing after checkpoint' is
> > implemented. I do not understand how process 'is being restarted' fits
> > in the checkpoint function.
> >
> > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > checkpoint the process and the plan is to use criu_restore() in
> > opal_crs_criu_restart() (which I have not yet implemented).
> >
> > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > It look fine except that the restart state is not flagged. When a process
> > > is restarted does it resume execution inside the criu_dump() function? If
> > > so, is there a way to tell from its return code (or some other mechanism)
> > > that it is being restarted versus continuing after checkpointing?
> > >
> > >
> > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > >
> > > > Great - looks fine to me!!
> > > >
> > > >
> > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> > > >
> > > > > I have prepared a patch I would like to commit which adds to code to
> > > > > actually checkpoint a process. Thanks for the pointers about the
> > string
> > > > > variables I tried to do implement it correctly.
> > > > >
> > > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > > >
> > > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > > resolved.
> > > > >
> > > > > The patch is at:
> > > > >
> > > > >
> > > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > >
> > > > > Adrian
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > devel_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > devel_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel