When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart currently
calls 'orterun'. These two programs are exactly the same (mpirun is a
symbolic link to orterun). So if you look for the PID of 'orterun'
that can be used to checkpoint the process.
However it is confusing that Open MPI makes this switch. So I
committed in r18208 a fix for this that uses the 'mpirun' binary name
instead of the 'orterun' binary name. This fits with the typical use
case of checkpoint/restart in Open MPI in which users expect to find
the 'mpirun' process on restart instead of the lesser known 'orterun'
Sorry for the confusion.
On Apr 18, 2008, at 1:14 AM, Tamer wrote:
> Dear all, I installed the developer's version r14519 and was able to
> get it running. I successfully checkpointed a parallel job and
> restarted it. My question is how can I checkpoint the restarted job?
> The problem is once the original job is terminated and restarted later
> on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
> I do not know which PID I should use when I run the ompi-checkpoint on
> the restarted job. Any help would be greatly appreciated.
> users mailing list