Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: Fernando Lemos (fernandotcl_at_[hidden])
Date: 2010-03-23 11:23:50


On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <fernyabc_at_[hidden]> wrote:
> I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
> program runs well on the clusters,
> but how to checkpoint the MPI program on this clusters?
> for example:
> here is what I do for a test:
> mpiu_at_nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
> hellompi
>  the program will run on the clusters
> then ,I enter:
> mpiu_at_nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)
>
> but the MPI program are not terminated as what happaned on single
> machine,although it created a checkpoint file“ompi_global_snapshot_
> 14030.ckpt“ in the home directory on master node.

Are you using OpenMPI 1.4 without a shared file system mounted at
$HOME? If yes, then take a look here:

http://www.open-mpi.org/community/lists/users/2010/03/12246.php

Regards,