Dear OMPI Users,


Im now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a very long time to checkpoint.


BLCR configuration:

./onfigure --prefix=/opt/blcr --enable-static

OpenMPi configuration:

./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr --enable-static  --enable-ft-thread --enable-mpi-threads


Our blades use NFS. $HOME and /opt are shared.


In $HOME/.opnempi/mca-params.conf:






Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04).

With no checkpoint, 'Time in seconds' is about 100s. It's normal.

But when I take a single checkpoint, 'Time in seconds' is up to 300s. The overhead ratio is over 200%! WHY? How can I improve it?


blade02:~> ompi-checkpoint --status 27115

[blade02:27130] [  0.00 /   0.25]                 Requested - ...

[blade02:27130] [  0.00 /   0.25]                   Pending - ...

[blade02:27130] [  0.21 /   0.46]                   Running - ...

[blade02:27130] [221.25 / 221.71]                  Finished - ompi_global_snapshot_27115.ckpt

Snapshot Ref.:   0 ompi_global_snapshot_27115.ckpt


As you see, it takes 200+ secconds to checkpoint. btw, what the former and latter number represent in [ , ]?