On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <fernyabc_at_[hidden]> wrote:
> I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
> program runs well on the clusters,
> but how to checkpoint the MPI program on this clusters?
> for example:
> here is what I do for a test:
> mpiu_at_nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
> the program will run on the clusters
> then ,I enter:
> mpiu_at_nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)
> but the MPI program are not terminated as what happaned on single
> machine,although it created a checkpoint fileompi_global_snapshot_
> 14030.ckpt in the home directory on master node.
Are you using OpenMPI 1.4 without a shared file system mounted at
$HOME? If yes, then take a look here: