Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] High Checkpoint Overhead Ratio
From: ³ÂÎÄºÆ (whchen_at_[hidden])
Date: 2010-08-30 23:08:47

Dear OMPI Users,


I¡¯m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a
very long time to checkpoint.


BLCR configuration:

./onfigure --prefix=/opt/blcr --enable-static

OpenMPi configuration:

./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr
--enable-static --enable-ft-thread --enable-mpi-threads


Our blades use NFS. $HOME and /opt are shared.


In $HOME/.opnempi/mca-params.conf:






Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04).

With no checkpoint, 'Time in seconds' is about 100s. It's normal.

But when I take a single checkpoint, 'Time in seconds' is up to 300s. The
overhead ratio is over 200%! WHY? How can I improve it?


blade02:~> ompi-checkpoint --status 27115

[blade02:27130] [ 0.00 / 0.25] Requested - ...

[blade02:27130] [ 0.00 / 0.25] Pending - ...

[blade02:27130] [ 0.21 / 0.46] Running - ...

[blade02:27130] [221.25 / 221.71] Finished -

Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt


As you see, it takes 200+ secconds to checkpoint. btw, what the former and
latter number represent in [ , ]?