We have some comments about the API of the new framework, mostly clarifications needed to better understand how this new framework is supposed to be used. And a request for a deadline extension, to delay the code merge from the Recos branch in the trunk by a week.
We have our own FT branch, with a totally different approach than what is described in your RFC. Unfortunately, it diverged from the trunk about a year ago, and merging back had proven to be a quite difficult task. Some of the functionality in the Recos framework is clearly beneficial for what we did, and has the potential to facilitate the porting of most of the features from our brach back in trunk. We would like the deadline extension in order to deeply analyze the impact of the Recos framework on our work, and see how we can fit everything together back in the trunk of Open MPI.
Here are some comments about the code:
1. The documentation in recos.h is not very clear. Most of the functions use only IN arguments, and are not supposed to return any values. We don't see how the functions are supposed to be used, and what is supposed to be their impact on the ORTE framework data.
2. Why do we have all the char***? Why are they only declared as IN arguments?
3. The orte_recos_base_process_fault_fn_t function use the node_list as an IN/OUT argument. Why? If the list is modified, then we have a scalability problem, as the list will have to be rebuilt before each call.
On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery capabilities
> WHERE: Impacts a number of ORTE files and a small number of OMPI files
> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 27-28
> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
> Josh and Ralph have been working on a private branch off of the trunk on extended fault recovery procedures, mostly impacting ORTE. The new code optionally allows ORTE to recover from failed nodes, moving processes to other nodes in order to maintain operation. In addition, the code provides better support for recovering from individual process failures.
> Not all of the work done on the private branch will be brought over in this commit. Some of the MPI-specific code that allows recovery from process failure on-the-fly will be committed separately at a later date.
> This commit will include the infrastructure to support those advanced recovery operations. Among other things, this commit will introduce a new "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies for responding to failures. The default module, called "ignore", will stabilize the runtime environment for other RecoS components. In the absence of other RecoS components it will trigger the default behavior (abort the job) to be executed.
> This branch includes some configure modifications that allow a comma separated list of options to be passed to the '--with-ft' option. This allows us to enable any combination of 'cr' and 'recos' at build time, specifically so that the RecoS functionally can be enabled independently of the C/R functionality. Most of the changes outside of the ORTE layer are due to symbol cleanup resulting from this modification.
> For example, C/R specific code paths were previously incorrectly marked with:
> #if OPAL_ENABLE_FT == 1
> They are now marked as, where appropriate:
> #if OPAL_ENABLE_FT_CR == 1
> Additionally, C/R specific components have modified configure.m4 files to change:
> AS_IF([test "$ompi_want_ft" = "0"],
> AS_IF([test "$ompi_want_ft_cr" = "0"],
> We have created a public repo (reference branch, above) with the code to be merged into the trunk. Please feel free to check it out and test it.
> NOTE: the new recovery capability is only active if...
> (a) you configure --with-ft=recos, and
> (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
> Comments, suggestions, and corrections are welcome!
> devel mailing list