Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-03-29 12:05:24


On Mar 29, 2010, at 11:53 AM, fengguang tian wrote:

> hi
>
> i have used the --term option,but the mpirun is still hanging,it is
> the same whether I include the ' / ' or not.I am installing the v1.4
> to see whether the problems are still there. I tried, but some
> problems are still there.

What configure options did you use when building Open MPI?

>
> BTW, my MPI program will have some input file, and will generate
> some output file after some computation, it can be checkpointed,but
> when restart it, some error happened,have you met this kind of
> problem?

Try putting the 'snapc_base_global_snapshot_dir' in the $HOME/.openmpi/
mca-params.conf file instead of just on the command line. Like:
snapc_base_global_snapshot_dir=/shared-dir/

I suspect that ompi-restart is looking in the wrong place for your
checkpoint. By default it will search $HOME (since that is the default
for snapc_base_global_snapshot_dir). If you put this parameter in the
mca-params.conf file, then it is always set in any tool (mpirun/ompi-
checkpoint/ompi-restart) to the specified value. So ompi-restart will
search the correct location for the checkpoint files.

-- Josh

>
> cheers
> fengguang
>
> On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey_at_open-
> mpi.org> wrote:
>
> On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
>
> On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian
> <fernyabc_at_[hidden]> wrote:
>
> I use mpirun -np 50 -am ft-enable-cr --mca
> snapc_base_global_snapshot_dir
> --hostfile .mpihostfile xxxx
> to store the global checkpoint snapshot into the shared
> directory:/mirror,but the problems are still there,
> when ompi-checkpoint, the mpirun is still not killed,it is hanging
> there.
>
> So the 'ompi-checkpoint' command does not finish? By default 'ompi-
> checkpoint' does not terminate the MPI job. If you pass the '--term'
> option to it, then it will.
>
>
>
> when doing ompi-restart, it shows:
>
> mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
> --------------------------------------------------------------------------
> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid
> because
> either you have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
>
>
> Try removing the trailing '/' in the command. The current ompi-
> restart is not good about differentiating between :
>
> ompi_global_snapshot_333.ckpt
> and
>
> ompi_global_snapshot_333.ckpt/
>
>
> Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
> 1.4 (but then I didn't try 1.4 with a shared filesystem).
>
> I would also suggest trying v1.4 or 1.5 to see if your problems
> persist with these versions.
>
> -- Josh
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users