checkpointing and restarting openmpi applications don't work for me.
I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
and openmpi version 1.6.3.
I have a simple parallel application that I want to checkpoint and restart.
I see that the blcr modules are loaded (with lsmod).
mpirun -np 1 -hostfile hostfile -am ft-enable-cr EXECUTABLE
ompi-checkpoint -v -s <PID of mpirun>
then I kill mpirun.
ompi-restart -v ompi_global_snapshot_<PID>.ckpt
here is my results:
Error: Unable to obtain the proper restart command to restart from the
checkpoint file (opal_snapshot_0.ckpt). Returned -1.
Check the installation of the none checkpoint/restart service
on all of the machines in your system.
If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine, it won't on more then one machine.
Please help me with this.
With Blessings, always,
IT Infrastructure Branch | Division of Information Systems
Weizmann Institute of Science
Rehovot 76100, Israel