Sorry that it has been awhile since the last update. As you know, we had tentatively planned an initial release of ORCM in April. Obviously, that isn't happening. :-/
However, we have made considerable progress towards that goal - enough that we can probably do a 0.x release in June, with a 1.0 release in fall of this year. The 0.x release will include a fully resilient system, which is what has been occupying our time over the last month. This includes:
* recovery of application processes that abnormally terminate. The system will restart the process A times on its current node, where A is a parameter that can be set two ways: (a) a default for the system (by MCA parameter), and (b) a per-job value given at time of job start. Once the process hits that limit, the local daemon will request that it be rescheduled to another location. This step can be done B times, where (you guessed it!) B can be set as a system-wide default and given as a per-job value. Once the process hits both limits, it will be marked as "un-runnable" and no further attempts to restart it will be made. The user and/or sys admin will be notified via a selectable method of this condition - again, this notification is controllable by system default and per-job.
* recovery from failed nodes. All processes on the failed node are re-located and restarted. This restart does -not- count against the process limits described above. When the node returns (or is replaced), the system may shift the processes back to it - again, this is a selectable option.
Much of the recent work has focused on conversion from a master-slave arrangement (i.e., where ORCM had a "master" process responsible for scheduling, and "slave" daemons on each node that simply started and monitored local processes) to a peer-based system by removing the "master" process. This has been accomplished by distributing the "master's" responsibilities. Thus, there is no system failure resulting from the loss of any one or more ORCM processes. The system will continue to recover from application failures up until the final node loses power.
It looks like we will not have an interface to a resource manager (e.g., Moab) prior to the initial release. What we will have in its place is the ability for the system to automatically wire itself together as nodes boot, with the ORCM daemon on each node automatically starting as part of the boot procedure. This creates one large system, though it can support multiple simultaneous jobs, each using a portion of the resources. Only limitation is that the resources have to be manually assigned to each job - obviously something that needs to be resolved longer term.
I have been tasked with documenting ORCM over the next couple of months. I have started entering information on the project web site (under the Open MPI web site), and will continue to do so. This will include video-on-demand presentations that describe various aspects of ORCM architecture and operation.
Please feel free to ask questions and/or contribute ideas. Anyone interested in contributing to the effort is welcome!