This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
As far as I know what Josh did is slightly different. In the case of a complete restart (where all processes are restarted from a checkpoint), he setup and rewire a new set of BTLs.
However, it happens that we do have some code to rewire the MPI processes in case of failure(s) in one of UTK projects. I'll have to talk with the team here, to see if at this point there is something we can contribute regarding this matter.
On Dec 15, 2009, at 21:08 , Ralph Castain wrote:
> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>>> It probably should be done at a lower level, but it begs a different question. For example, I've created the capability in the new cluster manager to detect interfaces that are lost, ride through the problem by moving affected procs to other nodes (reconnecting ORTE-level comm), and move procs back if/when nodes reappear. So someone can remove a node "on-the-fly" and replace that hardware with another node without having to stop and restart the job, etc. A lot of that infrastructure is now down inside ORTE, though a few key pieces remain in the ORCM code base (and most likely will stay there).
>>> Works great - unless it is an MPI job. If we can figure out a way for the MPI procs to (a) be properly restarted on the "new" node, and (b) update the BTL connection info on the other MPI procs in the job, then we would be good to go...
>>> Trivial problem, I am sure :-)
>> ...actually, the groundwork is there with Josh's work, isn't it? I think the real issue is handling un-graceful BTL failures properly. I'm guessing that's the biggest piece that isn't done...?
> Think so....not sure how to update the BTL's with the new info, but perhaps Josh has already done that problem.
>> Jeff Squyres
>> devel mailing list
> devel mailing list