Our c/r support is unfortunately deprecated due to loss of the person who wrote and supported it. So I'm afraid we are unable to really help with it, and c/r support will not be included in future releases unless someone becomes available to support it again.


On Jan 13, 2013, at 4:37 AM, Jerry Mersel <jerry.mersel@weizmann.ac.il> wrote:


checkpointing and restarting openmpi applications don't work for me.

I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
and openmpi version 1.6.3.

I have a simple parallel application that I want to checkpoint and restart.

I see that the blcr modules are loaded (with lsmod).

I run:
mpirun  -np 1 -hostfile hostfile -am ft-enable-cr  EXECUTABLE
ompi-checkpoint -v -s <PID of mpirun>

then I kill mpirun.

then:
ompi-restart -v ompi_global_snapshot_<PID>.ckpt

here is my results:

Error: Unable to obtain the proper restart command to restart from the 
       checkpoint file (opal_snapshot_0.ckpt). Returned -1.
       Check the installation of the none checkpoint/restart service
       on all of the machines in your system.



If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine,  it won't on more then one machine.

Please help me with this.

Thank you.





With Blessings, always,

   Jerry Mersel

   System Administrator
   IT Infrastructure Branch | Division of Information Systems
    Weizmann Institute of Science
    Rehovot 76100, Israel
  
   Tel:  +972-8-9342363
   
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users