Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Checkpointing a restarted app fails
From: Matthias Hovestadt (maho_at_[hidden])
Date: 2008-09-16 23:18:20


Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs_at_grid-demo-1:~$ ompi-clean
ccs_at_grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml

This starts my MPI-enabled application without any problems.

2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs_at_grid-demo-1:~$ ps auxww | grep mpirun
ccs 13897 0.4 0.2 63992 2704 pts/0 S+ 04:59 0:00 mpirun
-np 2 -am ft-enable-cr yafaray-xml yafaray.xml

Then I checkpointed the job, terminating it directly:

ccs_at_grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.: 0 ompi_global_snapshot_13897.ckpt
ccs_at_grid-demo-1:~$

The application indeed terminated:
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13898 on node
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs_at_grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367M ompi_global_snapshot_13897.ckpt/
ccs_at_grid-demo-1:~$

3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs_at_grid-demo-1:~$ ompi-clean
ccs_at_grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/

Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

   4 drwx------ 3 ccs ccs 4096 Sep 17 05:09
ompi_global_snapshot_13897.ckpt
   4 drwx------ 2 ccs ccs 4096 Sep 17 05:09 opal_snapshot_0.ckpt
   4 drwx------ 2 ccs ccs 4096 Sep 17 05:09 opal_snapshot_1.ckpt

4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs_at_grid-demo-1:~$ ps auxww | grep mpirun
ccs 14005 0.0 0.2 63992 2736 pts/1 S+ 05:09 0:00 mpirun
-am ft-enable-cr --app
/home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile

Then I checkpointed it:

ccs_at_grid-demo-1:~$ ompi-checkpoint 14005

When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 14050 on node
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
ccs_at_grid-demo-1:~$

The "ompi-checkpoint 14005" command however does not return.

Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?

Any hint is greatly appreciated! :-)

Best,
Matthias