Dear all, I installed the developer's version r14519 and was able to
get it running. I successfully checkpointed a parallel job and
restarted it. My question is how can I checkpoint the restarted job?
The problem is once the original job is terminated and restarted later
on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
I do not know which PID I should use when I run the ompi-checkpoint on
the restarted job. Any help would be greatly appreciated.