I've been doing some work on fault response within the system, and finally
realized something I should probably have seen awhile back. Perhaps I am
misunderstanding somewhere, so forgive the ignorance if so.
When we designed ORTE some time in the deep, dark past, we had envisioned
that people might want multiple ways of responding to process faults and/or
abnormal terminations. You might want to just abort the job, attempt to
restart just that proc, attempt to restart the job, etc. To support these
multiple options, and to provide a means for people to simply try new ones,
we created the errmgr framework.
Our thought was that a process and/or daemon would call the errmgr when we
detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.
However, I now see that the fault tolerance mechanisms inside of OMPI do not
seem to be using that methodology. Instead, we have hard-coded a particular
response into the system.
If we configure without FT, we just abort the entire job since that is the
only errmgr component that exists.
If we configure with FT, then we execute the hard-coded C/R methodology.
This is built directly into the code, so there is no option as to what
Is there a reason why the errmgr framework was not used? Did the FT team
decide that this was not a useful tool to support multiple FT strategies?
Can we modify it to better serve those needs, or is it simply not feasible?
If it isn't going to be used for that purpose, then I might as well remove
it. As things stand, there really is no purpose served by the errmgr
framework - might as well replace it with just a function call.
Appreciate any insights