Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-23 11:55:25

I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
--hostfile .mpihostfile xxxx
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.when doing ompi-restart, it shows:

mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
either you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.



On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos <fernandotcl_at_[hidden]>wrote:

> On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian <fernyabc_at_[hidden]>
> wrote:
> > I have created the shared file system. but I created a /mirror at root
> > directory,not at the $HOME directory,is that the
> > problem? thank you
> Others might be able to give you more a accurate explanation. The way
> I understood it, in OpenMPI 1.4, you need to write all checkpoints to
> a single, shared location. That's why you generally want a shared file
> system.
> Now I'm pretty sure you can change the directory to which the
> checkpoints are written. If you $HOME isn't a shared directory, you
> can point OpenMPI to write the checkpoints to the shared directory
> instead.
> In OpenMPI 1.5 (unstable), some magic allows you to create the
> checkpoints and restore them without a shared directory.
> Regards,
> _______________________________________________
> users mailing list
> users_at_[hidden]