Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-29 11:53:55


hi

i have used the --term option,but the mpirun is still hanging,it is the same
whether I include the ' / ' or not.I am installing the v1.4 to see whether
the problems are still there. I tried, but some problems are still there.

BTW, my MPI program will have some input file, and will generate some output
file after some computation, it can be checkpointed,but when restart it,
some error happened,have you met this kind of problem?

cheers
fengguang

On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:

>
> On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
>
> On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian <fernyabc_at_[hidden]>
>> wrote:
>>
>>>
>>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
>>> --hostfile .mpihostfile xxxx
>>> to store the global checkpoint snapshot into the shared
>>> directory:/mirror,but the problems are still there,
>>> when ompi-checkpoint, the mpirun is still not killed,it is hanging
>>> there.
>>>
>>
> So the 'ompi-checkpoint' command does not finish? By default
> 'ompi-checkpoint' does not terminate the MPI job. If you pass the '--term'
> option to it, then it will.
>
>
>
> when doing ompi-restart, it shows:
>>>
>>> mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
>>>
>>> --------------------------------------------------------------------------
>>> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
>>> either you have not provided a filename
>>> or provided an invalid filename.
>>> Please see --help for usage.
>>>
>>>
>>> --------------------------------------------------------------------------
>>>
>>
>>
> Try removing the trailing '/' in the command. The current ompi-restart is
> not good about differentiating between :
>
> ompi_global_snapshot_333.ckpt
> and
>
> ompi_global_snapshot_333.ckpt/
>
>
> Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
>> 1.4 (but then I didn't try 1.4 with a shared filesystem).
>>
>
> I would also suggest trying v1.4 or 1.5 to see if your problems persist
> with these versions.
>
> -- Josh
>
>
>
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>