WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery capabilities
WHERE: Impacts a number of ORTE files and a small number of OMPI files
TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 27-28
REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
Josh and Ralph have been working on a private branch off of the trunk on extended fault recovery procedures, mostly impacting ORTE. The new code optionally allows ORTE to recover from failed nodes, moving processes to other nodes in order to maintain operation. In addition, the code provides better support for recovering from individual process failures.
Not all of the work done on the private branch will be brought over in this commit. Some of the MPI-specific code that allows recovery from process failure on-the-fly will be committed separately at a later date.
This commit will include the infrastructure to support those advanced recovery operations. Among other things, this commit will introduce a new "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies for responding to failures. The default module, called "ignore", will stabilize the runtime environment for other RecoS components. In the absence of other RecoS components it will trigger the default behavior (abort the job) to be executed.
This branch includes some configure modifications that allow a comma separated list of options to be passed to the '--with-ft' option. This allows us to enable any combination of 'cr' and 'recos' at build time, specifically so that the RecoS functionally can be enabled independently of the C/R functionality. Most of the changes outside of the ORTE layer are due to symbol cleanup resulting from this modification.
For example, C/R specific code paths were previously incorrectly marked with:
#if OPAL_ENABLE_FT == 1
They are now marked as, where appropriate:
#if OPAL_ENABLE_FT_CR == 1
Additionally, C/R specific components have modified configure.m4 files to change:
AS_IF([test "$ompi_want_ft" = "0"],
AS_IF([test "$ompi_want_ft_cr" = "0"],
We have created a public repo (reference branch, above) with the code to be merged into the trunk. Please feel free to check it out and test it.
NOTE: the new recovery capability is only active if...
(a) you configure --with-ft=recos, and
(b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
Comments, suggestions, and corrections are welcome!