Hi Ralph and Josh,
>>> Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation.
>>> In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun update the global procs state for all jobs running under the failed daemon?
>> I haven't included the node failure case yet - still on my "to-do" list. In brief, the answer is yes/no. :-)
>> Daemon failure follows the same code path as shown in the flow chart. However, it is up to the individual modules to determine a response to that failure. The "orcm" RecoS module response is to (a) mark all procs on that node as having failed, (b) mark that node as "down" so it won't get reused, and (c) remap and restart all such procs on the remaining available nodes, starting new daemon(s) as required.
>> In the orcm environment, nodes that are replaced or rebooted automatically start their own daemon. This is detected by orcm, and the node state (if the node is rebooted) will automatically be updated to "up" - if it is a new node, it is automatically added to the available resources. This allows the node to be reused once the problem has been corrected. In other environments (ssh, slurm, etc), the node is simply left as "down" as there is no way to know if/when the node becomes available again.
>> If you aren't using the "orcm" module, then the default behavior will abort the job.
> Just to echo this response. The orted and process failures use the same error path, but can be easily differentiated by their jobids. The 'orcm' component is a good example of differentiating these two fault scenarios to correctly recover the ORTE job. Soon we may/should/will have the same ability with certain MPI jobs. :)
Hum... I'm really afraid about this. I understand your choice since it is really a good solution for fail/stop/restart behaviour, but looking from the fail/recovery side, can you envision some alternative for the orted's reconfiguration on the fly?