Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] new CRS component added (criu)
From: Adrian Reber (adrian_at_[hidden])
Date: 2014-02-07 16:46:07


I have created a new CRS component using criu (criu.org) to support
checkpoint/restart in Open MPI. My current patch only provides the
framework and necessary configure scripts to detect and link against
criu. With this patch orte-checkpoint can request a checkpoint and the
new CRIU CRS component is used:

[dcbz:13766] orte_cr: init: orte_cr_init()
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13771] opal_cr: init: Verbose Level: 30
[dcbz:13771] opal_cr: init: FT Enabled: true
[dcbz:13771] opal_cr: init: Is a tool program: false
[dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
[dcbz:13771] opal_cr: init: Checkpoint Signal: 10
[dcbz:13771] opal_cr: init: FT Use thread: true
[dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
[dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
[dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
[dcbz:13771] opal_cr: init: Temp Directory: /tmp
...
[dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
[dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
[dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13772] crs:criu: checkpoint(13772, ---)
[dcbz:13772] crs:criu: criu_init_opts() returned 0
[dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13771] crs:criu: checkpoint(13771, ---)
[dcbz:13771] crs:criu: criu_init_opts() returned 0
...
[dcbz:13766] 13766: Checkpoint established for process [55729,0].
[dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
[dcbz:13766] 13766: Successfully restarted process [55729,0].
[dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13772] orte_cr: coord: orte_cr_coord(Running)

It seems the C/R code basically works again and now needs to be filled
with the actual code to take checkpoints using criu.

The patch I want to check in is available at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea

The patch only creates files in opal/mca/crs/criu and does not touch any
other code.

                Adrian