Our c/r support is unfortunately deprecated due to loss of the person who wrote and supported it. So I'm afraid we are unable to really help with it, and c/r support will not be included in future releases unless someone becomes available to support it again.

On Jan 13, 2013, at 4:37 AM, Jerry Mersel <jerry.mersel@weizmann.ac.il> wrote:

checkpointing and restarting openmpi applications don't work for me.

I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
and openmpi version 1.6.3.

I have a simple parallel application that I want to checkpoint and restart.

I see that the blcr modules are loaded (with lsmod).

I run:
mpirun  -np 1 -hostfile hostfile -am ft-enable-cr  EXECUTABLE
ompi-checkpoint -v -s <PID of mpirun>

then I kill mpirun.

ompi-restart -v ompi_global_snapshot_<PID>.ckpt

here is my results:

Error: Unable to obtain the proper restart command to restart from the 
       checkpoint file (opal_snapshot_0.ckpt). Returned -1.
       Check the installation of the none checkpoint/restart service
       on all of the machines in your system.

If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine,  it won't on more then one machine.

Please help me with this.

Thank you.

