Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] new CRS component added (criu)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-02-07 17:08:48


Sweet -- +1 for CRIU support!

FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd actually go with making it a bit simpler. For example, I typically structure my configure.m4's like this (typed in mail client -- forgive mistakes...):

-----
   AS_IF([...some test....], [crs_criu_happy=1], [crs_criu_happy=0])
   # Only bother doing the next test if the previous one passed
   AS_IF([test $crs_criu_happy -eq 1 && ...next test....],
         [crs_criu_happy=1], [crs_criu_happy=0])
   # Only bother doing the next test if the previous one passed
   AS_IF([test $crs_criu_happy -eq 1 && ...next test....],
         [crs_criu_happy=1], [crs_criu_happy=0])

   ...etc...

   # Put a single execution of $2 and $3 at the end, depending on how the
   # above tests go. If a human asked for criu (e.g., --with-criu) and
   # we can't find criu support, that's a fatal error.
   AS_IF([test $crs_criu_happy -eq 1],
         [$2],
         [AS_IF([test "$with_criu" != "x" && "x$with_criu" != "xno"],
                [AC_MSG_WARN([You asked for CRIU support, but I can't find it.])
                 AC_MSG_ERROR([Cannot continue])],
                [$1])
          ])
-----

I note you have a stray $3 at the end of your configure.m4, too (it might supposed to be $2?).

Finally, I note you're looking for libcriu. Last time I checked with the CRIU guys -- which was quite a while ago -- that didn't exist (but I put in my $0.02 that OMPI would like to see such a userspace library). I take it that libcriu now exists?

On Feb 7, 2014, at 4:46 PM, Adrian Reber <adrian_at_[hidden]> wrote:

> I have created a new CRS component using criu (criu.org) to support
> checkpoint/restart in Open MPI. My current patch only provides the
> framework and necessary configure scripts to detect and link against
> criu. With this patch orte-checkpoint can request a checkpoint and the
> new CRIU CRS component is used:
>
> [dcbz:13766] orte_cr: init: orte_cr_init()
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13771] opal_cr: init: Verbose Level: 30
> [dcbz:13771] opal_cr: init: FT Enabled: true
> [dcbz:13771] opal_cr: init: Is a tool program: false
> [dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
> [dcbz:13771] opal_cr: init: Checkpoint Signal: 10
> [dcbz:13771] opal_cr: init: FT Use thread: true
> [dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
> [dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
> [dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
> [dcbz:13771] opal_cr: init: Temp Directory: /tmp
> ...
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
> [dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
> [dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13772] crs:criu: checkpoint(13772, ---)
> [dcbz:13772] crs:criu: criu_init_opts() returned 0
> [dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13771] crs:criu: checkpoint(13771, ---)
> [dcbz:13771] crs:criu: criu_init_opts() returned 0
> ...
> [dcbz:13766] 13766: Checkpoint established for process [55729,0].
> [dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
> [dcbz:13766] 13766: Successfully restarted process [55729,0].
> [dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Running)
>
> It seems the C/R code basically works again and now needs to be filled
> with the actual code to take checkpoints using criu.
>
> The patch I want to check in is available at:
>
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea
>
> The patch only creates files in opal/mca/crs/criu and does not touch any
> other code.
>
> Adrian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/