I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my
openmpi based program on a 3-node cluster (each node is a Intel Nehalem
based dual quad core) and I have been successful in checkpointing and
restarting the program successfully multiple times.
Recently I moved to a 15 node cluster with the same configuration and I
started seeing the problem with ompi-restart.
Ompi-checkpoint gets completed successfully and I terminate the program
after that. I have ensured that there are no MPI processes before I
restarted. When I restart using ompi-restart, I get the error in
restarting few of the MPI processes and the error message is "found pid
4185 in use; Restart failed: Device or Resource busy" (of course with
different pid numbers). What I found was that when the MPI process was
restarted, it got restarted on a different node than what it was running
before termination and found that it cannot reuse the pid.
Unlike cr_restart (BLCR), ompi-restart doesn't have an option to say not
to use the same pid with option such as "--no-restore-pid". Since
ompi-restart in turn calls cr_restart, I tried to alias cr_restart to
"cr_restart --no-restore-pid". This actually made the problem "pid in
use" go away and the process completes successfully. However if I call
ompi-checkpoint on the restarted open MPI job, both the openMPI job (all
MPI processes) and the checkpoint command hang forever. I guess this is
because of the fact that ompi-restart has different set of pids compared
to the actual pids that are running.
Long story short, I am stuck with this problem as I cannot get the
original pids during restart.
I really appreciate if you have any other options to share with me which
I can use to overcome this problem.
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.