On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:

Ralph, Josh,

We have some comments about the API of the new framework, mostly clarifications needed to better understand how this new framework is supposed to be used. And a request for a deadline extension, to delay the code merge from the Recos branch in the trunk by a week.

We have our own FT branch, with a totally different approach than what is described in your RFC. Unfortunately, it diverged from the trunk about a year ago, and merging back had proven to be a quite difficult task. Some of the functionality in the Recos framework is clearly beneficial for what we did, and has the potential to facilitate the porting of most of the features from our brach back in trunk. We would like the deadline extension in order to deeply analyze the impact of the Recos framework on our work, and see how we can fit everything together back in the trunk of Open MPI.

No problem with the extension - feel free to suggest modifications to make the merge easier. This is by no means cast in stone, but rather a starting point.

Here are some comments about the code:

1. The documentation in recos.h is not very clear. Most of the functions use only IN arguments, and are not supposed to return any values. We don't see how the functions are supposed to be used, and what is supposed to be their impact on the ORTE framework data.

I'll try to clarify the comments tonight (I know Josh is occupied right now). The recos APIs are called from two locations:

1. The errmgr calls recos whenever it receives a report of an aborted process (via the errmgr.proc_aborted API). The idea was for recos to determine what (if anything) to do about the failed process. 

2. The rmaps modules can call the recos "suggest_map_targets" API to get a list of suggested nodes for the process that is to be restarted. At the moment, only the resilient mapper module does this. However, Josh and I are looking at reorganizing some functionality currently in that mapper module and making all of the existing mappers be "resilient".

So basically, the recos modules determine the recovery procedure and execute it. For example, in the "orcm" module, we actually update the various proc/job objects to prep them for restart and call plm.spawn from within that module. If instead you use the ignore module, it falls through to the recos base functions which call "abort" to kill the job. Again, the action is taken local to recos, so nothing need be returned.

The functions generally don't return values (other than success/error) because we couldn't think of anything useful to return to the errmgr. Whatever recos does about an aborted proc, the errmgr doesn't do anything further - if you look in that code, you'll see that if recos is enabled, all the errmgr does is call recos and return.

Again, this can be changed if desired.

2. Why do we have all the char***? Why are they only declared as IN arguments?

I take it you mean in the predicted fault API? I believe Josh was including that strictly as a placeholder. As you undoubtedly recall, I removed the fddp framework from the trunk (devel continues off-line), so Josh wasn't sure what I might want to input here. If you look at the modules themselves, you will see the implementation is essentially empty at this time.

We had discussed simply removing that API for now until we determined if/when fault prediction would return to the OMPI trunk. It was kind of a tossup - so we left if for now. Could just as easily be removed until a later date - either way is fine with us.

3. The orte_recos_base_process_fault_fn_t function use the node_list as an IN/OUT argument. Why? If the list is modified, then we have a scalability problem, as the list will have to be rebuilt before each call.


typedef int (*orte_recos_base_process_fault_fn_t)
    (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t state, int *stack_state);

There is no node list, or list of any type, going in or out of that function. I suspect you meant the one below it:

typedef int (*orte_recos_base_suggest_map_targets_fn_t)
    (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);

I concur with your concern about scalability here. However, I believe the idea was that we would pass in the proc that failed and is to be restarted, a pointer to the node it was last on, and return a list of candidate nodes where it could be restarted. Essentially, this is the equivalent of building the target node list that we do in the mappers whenever we map a job.

So in the implementation, we use the rmaps base function to assemble the target node list for the app, and then go through some logic (e.g., remove the old node, look at fault groups and load balancing) to prune the list down. We then pass the resulting list back to the caller.

If we are going to have frequent process failures, then rebuilding the candidate node list every time would indeed be a problem. I suspect we'll have to revisit that implementation at some point.



On Feb 19, 2010, at 12:59 , Ralph Castain wrote:

WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk

WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery capabilities

WHERE: Impacts a number of ORTE files and a small number of OMPI files

TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 27-28

REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/



Josh and Ralph have been working on a private branch off of the trunk on extended fault recovery procedures, mostly impacting ORTE. The new code optionally allows ORTE to recover from failed nodes, moving processes to other nodes in order to maintain operation. In addition, the code provides better support for recovering from individual process failures.

Not all of the work done on the private branch will be brought over in this commit. Some of the MPI-specific code that allows recovery from process failure on-the-fly will be committed separately at a later date.

This commit will include the infrastructure to support those advanced recovery operations. Among other things, this commit will introduce a new "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies for responding to failures. The default module, called "ignore", will stabilize the runtime environment for other RecoS components. In the absence of other RecoS components it will trigger the default behavior (abort the job) to be executed.

This branch includes some configure modifications that allow a comma separated list of options to be passed to the '--with-ft' option. This allows us to enable any combination of 'cr' and 'recos' at build time, specifically so that the RecoS functionally can be enabled independently of the C/R functionality. Most of the changes outside of the ORTE layer are due to symbol cleanup resulting from this modification.

For example, C/R specific code paths were previously incorrectly marked with:
They are now marked as, where appropriate:

Additionally, C/R specific components have modified configure.m4 files to change:
AS_IF([test "$ompi_want_ft" = "0"],
AS_IF([test "$ompi_want_ft_cr" = "0"],

We have created a public repo (reference branch, above) with the code to be merged into the trunk. Please feel free to check it out and test it.

NOTE: the new recovery capability is only active if...
 (a) you configure --with-ft=recos, and
 (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!

Comments, suggestions, and corrections are welcome!

devel mailing list

devel mailing list