I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my openmpi based program on a 3-node cluster (each node is a Intel Nehalem based dual quad core) and I have been successful in checkpointing and restarting the program successfully multiple times.
Recently I moved to a 15 node cluster with the same configuration and I started seeing the problem with ompi-restart.
Ompi-checkpoint gets completed successfully and I terminate the program after that. I have ensured that there are no MPI processes before I restarted. When I restart using ompi-restart, I get the error in restarting few of the MPI processes and the error message is “found pid 4185 in use; Restart failed: Device or Resource busy” (of course with different pid numbers). What I found was that when the MPI process was restarted, it got restarted on a different node than what it was running before termination and found that it cannot reuse the pid.
Unlike cr_restart (BLCR), ompi-restart doesn’t have an option to say not to use the same pid with option such as “--no-restore-pid”. Since ompi-restart in turn calls cr_restart, I tried to alias cr_restart to “cr_restart --no-restore-pid”. This actually made the problem “pid in use” go away and the process completes successfully. However if I call ompi-checkpoint on the restarted open MPI job, both the openMPI job (all MPI processes) and the checkpoint command hang forever. I guess this is because of the fact that ompi-restart has different set of pids compared to the actual pids that are running.
Long story short, I am stuck with this problem as I cannot get the original pids during restart.
I really appreciate if you have any other options to share with me which I can use to overcome this problem.
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.