As far as I know what Josh did is slightly different. In the case of a complete restart (where all processes are restarted from a checkpoint), he setup and rewire a new set of BTLs.
However, it happens that we do have some code to rewire the MPI processes in case of failure(s) in one of UTK projects. I'll have to talk with the team here, to see if at this point there is something we can contribute regarding this matter.
On Dec 15, 2009, at 21:08 , Ralph Castain wrote:
> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>>> It probably should be done at a lower level, but it begs a different question. For example, I've created the capability in the new cluster manager to detect interfaces that are lost, ride through the problem by moving affected procs to other nodes (reconnecting ORTE-level comm), and move procs back if/when nodes reappear. So someone can remove a node "on-the-fly" and replace that hardware with another node without having to stop and restart the job, etc. A lot of that infrastructure is now down inside ORTE, though a few key pieces remain in the ORCM code base (and most likely will stay there).
>>> Works great - unless it is an MPI job. If we can figure out a way for the MPI procs to (a) be properly restarted on the "new" node, and (b) update the BTL connection info on the other MPI procs in the job, then we would be good to go...
>>> Trivial problem, I am sure :-)
>> ...actually, the groundwork is there with Josh's work, isn't it? I think the real issue is handling un-graceful BTL failures properly. I'm guessing that's the biggest piece that isn't done...?
> Think so....not sure how to update the BTL's with the new info, but perhaps Josh has already done that problem.
>> Jeff Squyres
>> devel mailing list
> devel mailing list