Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Checkpoint problem with BLCR + OpenMPI
From: ³ÂÎÄºÆ (whchen_at_[hidden])
Date: 2010-08-27 03:52:01

Dear OMPI Users,


I have installed BLCR(0.8.2) and OpenMPI(1.4.2) successfully. But now I met
a problem when I take a checkpoint.

I run CG NPB(NPROCS=16, two nodes: blade02 & blade04, CLASS=C, NFS: $HOME &
/opt are shared)


BLCR configure: ./configure ¨Cprefix=/opt/blcr ¨Cenable-static

OpenMPi configure: ./configure ¨Cprefix=/opt/ompi ¨Cwith-ft=cr ¨C
with-blcr=/opt/blcr ¨Cenable-static (I didn¡¯t add ¡®enable-ft-thread¡¯
param for I think it might affect the performance. Is it right?? And
mpi-threads are enabled by default, so I didn't add ¡®enable-mpi-threads¡¯
param) And Can anyone tell me these two params will make the checkpoint time
shorter or longer?

Our blades use NFS. $HOME and /opt are shared. The checkpoint file is
created in the $HOME directory by default. Will it cause the long checkpoint


In $HOME/.openmpi/mca-params.conf:





Then in mpirun terminal:

mpirun -machinefile mf -am ft-enable-cr -n 8 ./cg.C.8


In checkpoint terminal:

ompi-checkpoint --status 11133

[blade02:11171] Requested - Global Snapshot Reference:

[blade02:11171] Pending - Global Snapshot Reference:

[blade02:11171] Running - Global Snapshot Reference:

[blade02:11171] File Transfer - Global Snapshot Reference:


In mpirun terminal:


WARNING: Could not preload specified file: File already exists.


Fileset: $HOME/ompi-cr-file/ompi_global_snapshot_11133.ckpt/0

Host: blade02


Will continue attempting to launch the process.



[blade02:11133] 3 more processes have sent help message help-orte-filem-rsh.
txt / orte-filem-rsh:get-file-exists

[blade02:11133] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages


How to disable the ¡®preload¡¯ and how to solve this problems. Thanks.


Btw, when there is no mca-param.conf, and the checkpoint file is placed in
$HOME directory by default, I can checkpoint successfully. BUT, it takes a
very very long time to checkpoint. With no checkpoint, CG runs about 100s,
but with checkpoint, it runs 300s. 200% overhead ratio. WHY?