I know this list has been rather quiet since its inception over the holidays. However, I can assure you that a great deal of work has been done! So I thought it might be helpful if I provide a little status update - perhaps enough to get the conversation started.
Note that the basic cluster manager application (orcm) has been working for quite some time now. It reliably detects process failure, remaps the process to an appropriate node, and restarts it. If the application is using the ORCM communication layer, then it is automatically rewired.
What a couple of us have been doing over the last few weeks is using a particular application to drive development of the ORCM communication layer. This layer is important as it will provide the cluster manager and its daemons with a method for communicating in the face of failures, so we felt it critical that we nail the subsystem down. Most recently, I have added thread safety to that layer, and provided the ability to send and recv messages in parallel.
I encourage you to look at the wiki's "to-do" list (https://svn.open-mpi.org/trac/orcm/wiki/ActionItems) to see where we plan to go next. There are three areas that are actively under development:
1. making the underlying "reliable multicast" subsystem actually reliable. ORCM relies on ORTE's rmcast framework to provide this service - I will be working on that this week
2. detect and respond to node failures. At the moment, ORCM only reliably responds to process failures - code to respond to failure of an entire node has been written into the orcm application, but has not been well tested or debugged.
3. making the ORTE layer thread safe so we can enable the OPAL progress thread. This will allow async progress to occur and greatly improve ORCM's messaging capability. However, it is a significant challenge as well.
Some additional work is being done off to the side to allow the developers to publish, use it in a thesis, etc. In addition, there is strong overlap with the work being done in the OMPI community on ORTE, so some of the discussion can take place on that developer's mailing list as it impacts ORTE specifically. I will try to keep this list informed of those discussions so you don't have to subscribe to two places!
Obviously, the action item list on the wiki is far from complete. Anyone interested in contributing or having comments/suggestions on the above areas, anything on the wiki, or any other area is more than welcome to do so. Either send to the list, or feel free to send directly to me.
I believe we are still on-track for a formal "1.0" release sometime late this quarter, or next quarter at the latest. Not everything will be completed at that time, but we would like to see at least the major elements (node recovery, reliable messaging, thread safety) in place at that time. I'm hoping this community will assist in making the decision of when to release.
I apologize for not having gotten the wiki built up with design info, presentations, etc. I am committed to getting that info put together on the wiki as quickly as possible. As I do, I will send a notice out to this list.