Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RFC: Checkpoint/Restart Advancements and Bug Fixes
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-07-31 12:51:03


WHAT:
Checkpoint/Restart-based automatic recovery and process migration, advanced checkpoint storage, C/R-enabled debugging, MPI Extension API for C/R, and some bug fixes.

WHY:
This commit includes a variety of checkpoint/restart advancements that have been pending on a temporary branch for a long while. Users have been waiting on many of these bug fixes and advancements for a while now. More details below.

WHERE:
  http://bitbucket.org/jjhursey/ompi-cr-recos
Last sync'ed to trunk in r23536 (July 31, 2010)

WHEN:
Move into the trunk in the next two weeks. Then into the 1.5 series with the ORTE refresh (Ticket #2471).

TIMEOUT:
Aug 10, 2010 @ teleconf (commit at COB)

DOCUMENTATION
Following public site will be fully updated upon commit:
  http://osl.iu.edu/research/ft
Temporary documentation site (will be taken down upon commit):
  http://osl.iu.edu/~jjhursey/research/ft-www-preview
Man page documentation will be updated soon.

----------------------------------------------------------------------------
The changes may seem large but are isolated to a C/R components and frameworks except where they are wired into the infrastructure.

This commit brings in a variety of pending features and bug fixes that have been accumulating over the past 8-12 months. Highlights are below (full change log at bottom):
 * Added C/R-enabled Debugging Support
 * Added a Stable Storage framework for advanced checkpoint storage techniques
 * Added checkpoint caching and compression support
 * Added two C/R-based recovery policies
   * C/R-based Process Migration (API and ompi-migrate tool activated)
   * C/R-based Automatic Recovery
 * Added a variety of C/R MPI Extensions functions (e.g., Checkpoint, Restart, Migrate)
 * Added C/R progress meters to File Movement (FileM), Stable Storage (SStore), and Snapshot Coordination (SnapC) frameworks

While this RFC is pending I plan to clean up the man page documentation for these features and update copyrights in the code base.

Change Log:
-----------
Major Changes:
--------------
 * Added C/R-enabled Debugging support.
   Enabled with the --enable-crdebug flag. See the following website for more information:
   http://osl.iu.edu/research/ft/crdebug/
 * Added Stable Storage (SStore) framework for checkpoint storage
   * 'central' component does a direct to central storage save
   * 'stage' component stages checkpoints to central storage while the application continues execution.
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
 * Added Compression (compress) framework to support
 * Add two new ErrMgr recovery policies
   * {{{crmig}}} C/R Process Migration
   * {{{autor}}} C/R Automatic Recovery
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
   * {{{OMPI_CR_Checkpoint}}} (Fixes #2342)
   * {{{OMPI_CR_Restart}}}
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
   * {{{OMPI_CR_INC_register_callback}}} (Fixes #2192)
   * {{{OMPI_CR_Quiesce_start}}}
   * {{{OMPI_CR_Quiesce_checkpoint}}}
   * {{{OMPI_CR_Quiesce_end}}}
   * {{{OMPI_CR_self_register_checkpoint_callback}}}
   * {{{OMPI_CR_self_register_restart_callback}}}
   * {{{OMPI_CR_self_register_continue_callback}}}
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
 * Add a progress meter to:
   * FileM rsh (filem_rsh_process_meter)
   * SnapC full (snapc_full_progress_meter)
   * SStore stage (sstore_stage_progress_meter)
 * Added 2 new command line options to ompi-restart
   * --showme : Display the full command line that would have been exec'ed.
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes #2413)
 * Deprecated some MCA params:
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
   * snapc_base_store_in_place deprecated, replaced with different components of SStore
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem

Minor Changes:
--------------
 * Fixes #1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
 * Fixes #2097 : {{{ompi-info}}} should now report all available CRS components
 * Fixes #2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
 * Fixes #2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
 * Cleanup the CRS framework and components to work with the SStore framework.
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
 * Add 'quiesce' hook to CRCP for a future enhancement.
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface).
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
 * {{{opal-restart}}} also support local decompression before restarting
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
 * Improve the checks for 'already checkpointing' error path.
 * A a recovery output timer, to show how long it takes to restart a job
 * Do a better job of cleaning up the old session directory on restart.
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.