Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: Fernando Lemos (fernandotcl_at_[hidden])
Date: 2010-03-23 13:00:02


On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian <fernyabc_at_[hidden]> wrote:
>
> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
> --hostfile .mpihostfile xxxx
> to store the global checkpoint snapshot into the shared
> directory:/mirror,but the problems are still there,
> when ompi-checkpoint, the mpirun is still not killed,it is hanging
> there.when doing ompi-restart, it shows:
>
> mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
> --------------------------------------------------------------------------
> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
> either you have not provided a filename
>        or provided an invalid filename.
>        Please see --help for usage.
>
> --------------------------------------------------------------------------

Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).