I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile xxxx
to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging there.when doing ompi-restart, it shows:

mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--------------------------------------------------------------------------
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because either you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------

cheers
fengguang

On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos <fernandotcl@gmail.com> wrote:
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian <fernyabc@gmail.com> wrote:
> I have created the shared file system. but I created a /mirror at root
> directory,not at the $HOME directory,is that the
> problem? thank you

Others might be able to give you more a accurate explanation. The way
I understood it, in OpenMPI 1.4, you need to write all checkpoints to
a single, shared location. That's why you generally want a shared file
system.

Now I'm pretty sure you can change the directory to which the
checkpoints are written. If you $HOME isn't a shared directory, you
can point OpenMPI to write the checkpoints to the shared directory
instead.

In OpenMPI 1.5 (unstable), some magic allows you to create the
checkpoints and restore them without a shared directory.

Regards,
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users