Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] restarting checkpointed applications
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-13 11:27:58


Our c/r support is unfortunately deprecated due to loss of the person who wrote and supported it. So I'm afraid we are unable to really help with it, and c/r support will not be included in future releases unless someone becomes available to support it again.

On Jan 13, 2013, at 4:37 AM, Jerry Mersel <jerry.mersel_at_[hidden]> wrote:

>
> checkpointing and restarting openmpi applications don't work for me.
>
> I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
> and openmpi version 1.6.3.
>
> I have a simple parallel application that I want to checkpoint and restart.
>
> I see that the blcr modules are loaded (with lsmod).
>
> I run:
> mpirun -np 1 -hostfile hostfile -am ft-enable-cr EXECUTABLE
> ompi-checkpoint -v -s <PID of mpirun>
>
> then I kill mpirun.
>
> then:
> ompi-restart -v ompi_global_snapshot_<PID>.ckpt
>
> here is my results:
>
> Error: Unable to obtain the proper restart command to restart from the
> checkpoint file (opal_snapshot_0.ckpt). Returned -1.
> Check the installation of the none checkpoint/restart service
> on all of the machines in your system.
>
>
>
> If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine, it won't on more then one machine.
>
> Please help me with this.
>
> Thank you.
>
>
>
>
>
> With Blessings, always,
>
> Jerry Mersel
>
> System Administrator
> IT Infrastructure Branch | Division of Information Systems
> Weizmann Institute of Science
> Rehovot 76100, Israel
>
> Tel: +972-8-9342363
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users