Dear OMPI Users,
I¡¯m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a
very long time to checkpoint.
BLCR configuration:
./onfigure --prefix=/opt/blcr --enable-static
OpenMPi configuration:
./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr
--enable-static --enable-ft-thread --enable-mpi-threads
Our blades use NFS. $HOME and /opt are shared.
In $HOME/.opnempi/mca-params.conf:
crs_base_snapshot_dir=/tmp/
snapc_base_global_snapshot_dir=/home/chenwh
snapc_basee_store_in_place=0
Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04).
With no checkpoint, 'Time in seconds' is about 100s. It's normal.
But when I take a single checkpoint, 'Time in seconds' is up to 300s. The
overhead ratio is over 200%! WHY? How can I improve it?
blade02:~> ompi-checkpoint --status 27115
[blade02:27130] [ 0.00 / 0.25] Requested - ...
[blade02:27130] [ 0.00 / 0.25] Pending - ...
[blade02:27130] [ 0.21 / 0.46] Running - ...
[blade02:27130] [221.25 / 221.71] Finished -
ompi_global_snapshot_27115.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt
As you see, it takes 200+ secconds to checkpoint. btw, what the former and
latter number represent in [ , ]?
Regards
Whchen
|