Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Checkpoint/Restart Advancements and Bug Fixes
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-10 16:52:47

Committed in r23587


On Jul 31, 2010, at 12:51 PM, Joshua Hursey wrote:

> Checkpoint/Restart-based automatic recovery and process migration, advanced checkpoint storage, C/R-enabled debugging, MPI Extension API for C/R, and some bug fixes.
> WHY:
> This commit includes a variety of checkpoint/restart advancements that have been pending on a temporary branch for a long while. Users have been waiting on many of these bug fixes and advancements for a while now. More details below.
> Last sync'ed to trunk in r23536 (July 31, 2010)
> Move into the trunk in the next two weeks. Then into the 1.5 series with the ORTE refresh (Ticket #2471).
> Aug 10, 2010 @ teleconf (commit at COB)
> Following public site will be fully updated upon commit:
> Temporary documentation site (will be taken down upon commit):
> Man page documentation will be updated soon.
> ----------------------------------------------------------------------------
> The changes may seem large but are isolated to a C/R components and frameworks except where they are wired into the infrastructure.
> This commit brings in a variety of pending features and bug fixes that have been accumulating over the past 8-12 months. Highlights are below (full change log at bottom):
> * Added C/R-enabled Debugging Support
> * Added a Stable Storage framework for advanced checkpoint storage techniques
> * Added checkpoint caching and compression support
> * Added two C/R-based recovery policies
> * C/R-based Process Migration (API and ompi-migrate tool activated)
> * C/R-based Automatic Recovery
> * Added a variety of C/R MPI Extensions functions (e.g., Checkpoint, Restart, Migrate)
> * Added C/R progress meters to File Movement (FileM), Stable Storage (SStore), and Snapshot Coordination (SnapC) frameworks
> While this RFC is pending I plan to clean up the man page documentation for these features and update copyrights in the code base.
> Change Log:
> -----------
> Major Changes:
> --------------
> * Added C/R-enabled Debugging support.
> Enabled with the --enable-crdebug flag. See the following website for more information:
> * Added Stable Storage (SStore) framework for checkpoint storage
> * 'central' component does a direct to central storage save
> * 'stage' component stages checkpoints to central storage while the application continues execution.
> * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
> * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
> * Added Compression (compress) framework to support
> * Add two new ErrMgr recovery policies
> * {{{crmig}}} C/R Process Migration
> * {{{autor}}} C/R Automatic Recovery
> * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
> * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
> * {{{OMPI_CR_Checkpoint}}} (Fixes #2342)
> * {{{OMPI_CR_Restart}}}
> * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
> * {{{OMPI_CR_INC_register_callback}}} (Fixes #2192)
> * {{{OMPI_CR_Quiesce_start}}}
> * {{{OMPI_CR_Quiesce_checkpoint}}}
> * {{{OMPI_CR_Quiesce_end}}}
> * {{{OMPI_CR_self_register_checkpoint_callback}}}
> * {{{OMPI_CR_self_register_restart_callback}}}
> * {{{OMPI_CR_self_register_continue_callback}}}
> * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
> * Add a progress meter to:
> * FileM rsh (filem_rsh_process_meter)
> * SnapC full (snapc_full_progress_meter)
> * SStore stage (sstore_stage_progress_meter)
> * Added 2 new command line options to ompi-restart
> * --showme : Display the full command line that would have been exec'ed.
> * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes #2413)
> * Deprecated some MCA params:
> * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
> * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
> * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
> * snapc_base_store_in_place deprecated, replaced with different components of SStore
> * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
> * snapc_base_establish_global_snapshot_dir deprecated, never well supported
> * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
> Minor Changes:
> --------------
> * Fixes #1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
> * Fixes #2097 : {{{ompi-info}}} should now report all available CRS components
> * Fixes #2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
> * Fixes #2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
> * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
> * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
> * Cleanup the CRS framework and components to work with the SStore framework.
> * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
> * Add 'quiesce' hook to CRCP for a future enhancement.
> * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
> * Add optional application level INC callbacks (registered through the CR MPI Ext interface).
> * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
> * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
> * {{{opal-restart}}} also support local decompression before restarting
> * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
> * {{{orte-restart}}} now uses the SStore framework to work with the metadata
> * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
> * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
> * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
> * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
> * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
> * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
> * Improve the checks for 'already checkpointing' error path.
> * A a recovery output timer, to show how long it takes to restart a job
> * Do a better job of cleaning up the old session directory on restart.
> * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
> * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]