Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2014-02-18 15:47:21


Yep. For the checkpoint/continue that patch looks good.

On Tue, Feb 18, 2014 at 11:30 AM, Adrian Reber <adrian_at_[hidden]> wrote:

> On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> > So when a process is restarted with CRIU, does it resume execution after
> > the criu_dump() or somewhere else?
>
> The process is resumed at the same point it was checkpointed with
> criu_dump().
>
> > In a continue/leave-running mode after checkpoint the MPI library does
> not
> > need to do quite a much work since we can depend on some things not
> > changing (such as the machine name, orted pid, ...).
>
> During criu_dump() nothing changes.
>
> > In a restart mode then the entire library has to be updated - much more
> > expensive than the continue mode.
>
> Ah. If I understand you correctly there are C/R methods which require
> that the checkpointed process is terminated and needs to be restarted to
> continue running. CRIU is completely transparent for the process. It
> needs no special environment (LD_PRELOAD) nor any special handling.
> criu_dump() pauses the process, checkpoints it and (if desired) lets it
> continue in the same state it was before.
>
> > The CRS components that we have supported emerge from their checkpointing
> > function (criu_dump in your case) knowing if they are in the continue or
> > restart mode. So that CRS function sets the flag according so the rest of
> > the library can do the right thing afterwards.
>
> So, I would say CRIU CRS is in continue mode after criu_dump().
>
> > The restart function is called by the opal_restart tool to restart the
> > process from an image. Some checkpointers have a library call to restart
> a
> > process others used external tools to do so. So that interface just let's
> > the checkpointer decide, given a snapshot image, how it should restart
> that
> > process. The restarted process is assumed to wake up in the
> > opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> > function name can be a bit misleading.
> >
> > Does that help?
>
> That helps a lot. Thanks. I am not 100% sure I understand the restart
> case, but I will try to implement it and probably then I will understand
> how it works.
>
> Would you say, that for the checkpoint only functionality in continue
> mode the patch can be checked in?
>
> Adrian
>
> > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> >
> > > I think I do not understand your question. So far I have only
> implemented
> > > the
> > > checkpoint part and not the restart part.
> > >
> > > Using criu_dump() the process can be left in three different
> > > states. Without any special handling the process is dumped and then
> > > killed. I can also tell criu to leave the process stopped
> (--leave-stopped)
> > > or running (--leave-running). I decided to default to --leave-running
> so
> > > that after the checkpoint has been performed the process continues
> > > running where it stopped.
> > >
> > > What would be the difference between 'being restarted versus continuing
> > > after checkpointing'? Right now only 'continuing after checkpoint' is
> > > implemented. I do not understand how process 'is being restarted' fits
> > > in the checkpoint function.
> > >
> > > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > > checkpoint the process and the plan is to use criu_restore() in
> > > opal_crs_criu_restart() (which I have not yet implemented).
> > >
> > > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > > It look fine except that the restart state is not flagged. When a
> process
> > > > is restarted does it resume execution inside the criu_dump()
> function? If
> > > > so, is there a way to tell from its return code (or some other
> mechanism)
> > > > that it is being restarted versus continuing after checkpointing?
> > > >
> > > >
> > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> > > >
> > > > > Great - looks fine to me!!
> > > > >
> > > > >
> > > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber <adrian_at_[hidden]>
> wrote:
> > > > >
> > > > > > I have prepared a patch I would like to commit which adds to
> code to
> > > > > > actually checkpoint a process. Thanks for the pointers about the
> > > string
> > > > > > variables I tried to do implement it correctly.
> > > > > >
> > > > > > CRIU currently has problems with the new OOB usock but I will
> contact
> > > > > > the CRIU developers about this error. Using tcp, checkpointing
> works.
> > > > > >
> > > > > > CRIU also has problems with --np > 1, but I am sure this can
> also be
> > > > > > resolved.
> > > > > >
> > > > > > The patch is at:
> > > > > >
> > > > > >
> > > > >
> > >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > > >
> > > > > > Adrian
> > > > > > _______________________________________________
> > > > > > devel mailing list
> > > > > > devel_at_[hidden]
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > devel_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey