On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
> It probably should be done at a lower level, but it begs a different question. For example, I've created the capability in the new cluster manager to detect interfaces that are lost, ride through the problem by moving affected procs to other nodes (reconnecting ORTE-level comm), and move procs back if/when nodes reappear. So someone can remove a node "on-the-fly" and replace that hardware with another node without having to stop and restart the job, etc. A lot of that infrastructure is now down inside ORTE, though a few key pieces remain in the ORCM code base (and most likely will stay there).
> Works great - unless it is an MPI job. If we can figure out a way for the MPI procs to (a) be properly restarted on the "new" node, and (b) update the BTL connection info on the other MPI procs in the job, then we would be good to go...
> Trivial problem, I am sure :-)
...actually, the groundwork is there with Josh's work, isn't it? I think the real issue is handling un-graceful BTL failures properly. I'm guessing that's the biggest piece that isn't done...?