Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] restarting checkpointed applications
From: Jerry Mersel (jerry.mersel_at_[hidden])
Date: 2013-01-13 07:37:27


checkpointing and restarting openmpi applications don't work for me.

I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
and openmpi version 1.6.3.

I have a simple parallel application that I want to checkpoint and restart.

I see that the blcr modules are loaded (with lsmod).

I run:
mpirun -np 1 -hostfile hostfile -am ft-enable-cr EXECUTABLE
ompi-checkpoint -v -s <PID of mpirun>

then I kill mpirun.

then:
ompi-restart -v ompi_global_snapshot_<PID>.ckpt

here is my results:

Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_0.ckpt). Returned -1.
       Check the installation of the none checkpoint/restart service
       on all of the machines in your system.

If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine, it won't on more then one machine.

Please help me with this.

Thank you.

With Blessings, always,

   Jerry Mersel

   System Administrator
   IT Infrastructure Branch | Division of Information Systems
    Weizmann Institute of Science
    Rehovot 76100, Israel

   Tel: +972-8-9342363