Dear OMPI Users,

 

I have installed BLCR(0.8.2) and OpenMPI(1.4.2) successfully. But now I met a problem when I take a checkpoint.

I run CG NPB(NPROCS=16, two nodes: blade02 & blade04, CLASS=C, NFS: $HOME & /opt are shared)

 

BLCR configure: ./configure 每prefix=/opt/blcr 每enable-static

OpenMPi configure: ./configure 每prefix=/opt/ompi 每with-ft=cr 每with-blcr=/opt/blcr 每enable-static (I didn*t add &enable-ft-thread* param for I think it might affect the performance. Is it right?? And mpi-threads are enabled by default, so I didn't add &enable-mpi-threads* param) And Can anyone tell me these two params will make the checkpoint time shorter or longer?

Our blades use NFS. $HOME and /opt are shared. The checkpoint file is created in the $HOME directory by default. Will it cause the long checkpoint time???

 

In $HOME/.openmpi/mca-params.conf:

crs_base_snapshot_dir=/tmp/

snapc_base_global_snapshot_dir=$HOME/ompi-cr-file

snapc_base_store_in_place=0

 

Then in mpirun terminal:

mpirun -machinefile mf -am ft-enable-cr -n 8 ./cg.C.8

 

In checkpoint terminal:

ompi-checkpoint --status 11133

[blade02:11171]                 Requested - Global Snapshot Reference: (null)

[blade02:11171]                   Pending - Global Snapshot Reference: (null)

[blade02:11171]                   Running - Global Snapshot Reference: (null)

[blade02:11171]             File Transfer - Global Snapshot Reference: (null)

 

In mpirun terminal:

--------------------------------------------------------------------------

WARNING: Could not preload specified file: File already exists.

 

Fileset: $HOME/ompi-cr-file/ompi_global_snapshot_11133.ckpt/0

Host: blade02

 

Will continue attempting to launch the process.

 

--------------------------------------------------------------------------

[blade02:11133] 3 more processes have sent help message help-orte-filem-rsh.txt / orte-filem-rsh:get-file-exists

[blade02:11133] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

 

How to disable the &preload* and how to solve this problems. Thanks.

 

Btw, when there is no mca-param.conf, and the checkpoint file is placed in $HOME directory by default, I can checkpoint successfully. BUT, it takes a very very long time to checkpoint. With no checkpoint, CG runs about 100s, but with checkpoint, it runs 300s. 200% overhead ratio. WHY?

 

Regards

 

Whchen