I'm trying to implement checkpointing on out cluster, and I have obvious question.
I guess this was implemented many times by other users, so I would like is someone share experience with me.
With serial/multithreaded jobs it is kind of clear. But for parallel?
We have "fat" 16-core nodes, so user use both OpenMP and MPI programs.
Shell I just do perform some checks in my checkpointing script and call ompi-checkpoint if after tests I figure our that there is MPI job?
What is "usual" way?