Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing a restarted app fails
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-09-17 07:27:30


On Sep 16, 2008, at 11:18 PM, Matthias Hovestadt wrote:

> Hi!
>
> Since I am interested in fault tolerance, checkpointing and
> restart of OMPI is an intersting feature for me. So I installed
> BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
> I followed the instructions in the "Fault Tolerance Guide"
> in the OMPI wiki:
>
> ./autogen.sh
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
> make -s
>
> This gave me an OMPI version with checkpointing support, so I
> started testing. The good news is: I am able to checkpoint and
> restart applications. The bad news is: checkpointing a restarted
> application fails.
>
> In detail:
>
> 1) Starting the application
>
> ccs_at_grid-demo-1:~$ ompi-clean
> ccs_at_grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml
> yafaray.xml
>
> This starts my MPI-enabled application without any problems.
>
>
> 2) Checkpointing the application
>
> First I queried the PID of the mpirun process:
>
> ccs_at_grid-demo-1:~$ ps auxww | grep mpirun
> ccs 13897 0.4 0.2 63992 2704 pts/0 S+ 04:59 0:00
> mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml
>
> Then I checkpointed the job, terminating it directly:
>
> ccs_at_grid-demo-1:~$ ompi-checkpoint --term 13897
> Snapshot Ref.: 0 ompi_global_snapshot_13897.ckpt
> ccs_at_grid-demo-1:~$
>
> The application indeed terminated:
> ----------------------------------------------------------------------
> ----
> mpirun noticed that process rank 0 with PID 13898 on node grid-
> demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).
> ----------------------------------------------------------------------
> ----
> 2 total processes killed (some possibly by mpirun during cleanup)
>
> The checkpoint command generated a checkpoint dataset
> of 367MB size:
>
> ccs_at_grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
> 367M ompi_global_snapshot_13897.ckpt/
> ccs_at_grid-demo-1:~$
>
>
>
> 3) Restarting the application
>
> For restarting the application, I first executed ompi-clean,
> then restarting the job with preloading all files:
>
> ccs_at_grid-demo-1:~$ ompi-clean
> ccs_at_grid-demo-1:~$ ompi-restart --preload
> ompi_global_snapshot_13897.ckpt/
>
> Restarting works pretty fine. The jobs restarts from the
> checkpointed state and continues to execute. If not interrupted,
> it continues until its end, returning a correct result.
>
> However, I observed one weird thing: restarting the application
> seemed to have the checkpoint dataset changed. Moreover, two new
> directories have been created at restart time:
>
> 4 drwx------ 3 ccs ccs 4096 Sep 17 05:09
> ompi_global_snapshot_13897.ckpt
> 4 drwx------ 2 ccs ccs 4096 Sep 17 05:09 opal_snapshot_0.ckpt
> 4 drwx------ 2 ccs ccs 4096 Sep 17 05:09 opal_snapshot_1.ckpt
>
>

The ('opal_snapshot_*.ckpt') directories are an artifact of the --
preload option. This option will copy the individual checkpoint to
the remote machine before executing.

>
> 4) Checkpointing again
>
> Again I first looked for the PID of the running mpirun process:
>
> ccs_at_grid-demo-1:~$ ps auxww | grep mpirun
> ccs 14005 0.0 0.2 63992 2736 pts/1 S+ 05:09 0:00
> mpirun -am ft-enable-cr --app /home/ccs/
> ompi_global_snapshot_13897.ckpt/restart-appfile
>
>
> Then I checkpointed it:
>
> ccs_at_grid-demo-1:~$ ompi-checkpoint 14005
>
>
> When executing this checkpoint command, the running application
> directly aborts, even though I did not specify the "--term" option:
>
> ----------------------------------------------------------------------
> ----
> mpirun noticed that process rank 1 with PID 14050 on node grid-
> demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
> ----------------------------------------------------------------------
> ----
> ccs_at_grid-demo-1:~$

Interesting. This looks like a bug with the restart mechanism in Open
MPI. This was working fine, but something must have changed in the
trunk to break it.

A useful piece of debugging information for me would be a stack trace
from the failed process. You should be able to get this from a core
file it left or If you would set the following MCA variable in
$HOME/.openmpi/mca-params.conf:
   opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it
detects a Broken Pipe signal. Then you should be able to attach a
debugger and retrieve a stack trace.

>
>
> The "ompi-checkpoint 14005" command however does not return.
>
>
>
> Is anybody here using checkpoint/restart capabilities of OMPI?
> Did anybody encounter similar problems? Or is there something
> wrong about my way of using ompi-checkpoint/ompi-restart?

I work with the checkpoint/restart functionality on a daily basis,
but I must admit that I haven't worked on the trunk in a few weeks.
I'll take a look and let you know what I find. I suspect that Open
MPI is not resetting properly after a checkpoint.

>
>
> Any hint is greatly appreciated! :-)
>
>
>
> Best,
> Matthias
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users