I believe you are thinking parallel to what Josh and I have been doing, and slightly different to the UTK approach. The "orcm" method follows what you describe: we maintain operation on the current remaining nodes, see if we can use another new node to replace the failed one, and redistribute the affected procs (on the failed node) either to existing nodes or to new ones.
I believe UTK's approach focuses on retaining operation of the existing nodes, redistributing procs across them. I suspect we will eventually integrate some of these operations so that users can exploit the best of both methods.
Josh hasn't exposed his MPI recovery work yet. As he mentioned in his response, he has done some things in this area that are complementary to the UTK method. Just needs to finish his thesis before making them public. :-)
>> Hum... I'm really afraid about this. I understand your choice since it is really a good solution for fail/stop/restart behaviour, but looking from the fail/recovery side, can you envision some alternative for the orted's reconfiguration on the fly?
> I don't see why the current code prohibit such behavior. However, I don't see right now in this branch how the remaining daemons (and MPI processes) reconstruct the communication topology, but this is just a technicality.When you say MPI layer, what exactly it means? The MPI interface or the network stack which supports the MPI communication (BTL, PML, etc.)?
> Anyway, this is the code that UT will bring in. All our work focus on maintaining the exiting environment up and running instead of restarting everything. The orted will auto-heal (i.e reshape the underlying topology, recreate the connections, and so on), and the fault is propagated to the MPI layer who will take the decision on what to do next.
In my mind I see an orted failure (and all procs running under this deamon) as an environment failure which leads to job failures. Thus, to use a fail/recovery strategy, this daemons should be recovered (possibly relaunching and updating its procs/jobs structures) and after that all failed procs which are originally running under this daemon should be recovered also (maybe from a checkpoint, log optionally). Of course, in available, an spare orted could be used.
Regarding to the MPI application, probably this 'environment reconfiguration' requires updates/reconfiguration/whatever on the communication stack which supports the MPI communication (BTL, PML, etc.).
Are we thinking in the same direction or I have missed something in the way?
devel mailing list