Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-03-29 11:42:14


On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:

> On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian
> <fernyabc_at_[hidden]> wrote:
>>
>> I use mpirun -np 50 -am ft-enable-cr --mca
>> snapc_base_global_snapshot_dir
>> --hostfile .mpihostfile xxxx
>> to store the global checkpoint snapshot into the shared
>> directory:/mirror,but the problems are still there,
>> when ompi-checkpoint, the mpirun is still not killed,it is hanging
>> there.

So the 'ompi-checkpoint' command does not finish? By default 'ompi-
checkpoint' does not terminate the MPI job. If you pass the '--term'
option to it, then it will.

>> when doing ompi-restart, it shows:
>>
>> mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
>> --------------------------------------------------------------------------
>> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid
>> because
>> either you have not provided a filename
>> or provided an invalid filename.
>> Please see --help for usage.
>>
>> --------------------------------------------------------------------------
>

Try removing the trailing '/' in the command. The current ompi-restart
is not good about differentiating between :
   ompi_global_snapshot_333.ckpt
and
   ompi_global_snapshot_333.ckpt/

> Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
> 1.4 (but then I didn't try 1.4 with a shared filesystem).

I would also suggest trying v1.4 or 1.5 to see if your problems
persist with these versions.

-- Josh

>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users