Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-29 11:53:55


hi

i have used the --term option,but the mpirun is still hanging,it is the same
whether I include the ' / ' or not.I am installing the v1.4 to see whether
the problems are still there. I tried, but some problems are still there.

BTW, my MPI program will have some input file, and will generate some output
file after some computation, it can be checkpointed,but when restart it,
some error happened,have you met this kind of problem?

cheers
fengguang

On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:

>
> On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
>
> On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian <fernyabc_at_[hidden]>
>> wrote:
>>
>>>
>>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
>>> --hostfile .mpihostfile xxxx
>>> to store the global checkpoint snapshot into the shared
>>> directory:/mirror,but the problems are still there,
>>> when ompi-checkpoint, the mpirun is still not killed,it is hanging
>>> there.
>>>
>>
> So the 'ompi-checkpoint' command does not finish? By default
> 'ompi-checkpoint' does not terminate the MPI job. If you pass the '--term'
> option to it, then it will.
>
>
>
> when doing ompi-restart, it shows:
>>>
>>> mpiu_at_nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
>>>
>>> --------------------------------------------------------------------------
>>> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
>>> either you have not provided a filename
>>> or provided an invalid filename.
>>> Please see --help for usage.
>>>
>>>
>>> --------------------------------------------------------------------------
>>>
>>
>>
> Try removing the trailing '/' in the command. The current ompi-restart is
> not good about differentiating between :
>
> ompi_global_snapshot_333.ckpt
> and
>
> ompi_global_snapshot_333.ckpt/
>
>
> Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
>> 1.4 (but then I didn't try 1.4 with a shared filesystem).
>>
>
> I would also suggest trying v1.4 or 1.5 to see if your problems persist
> with these versions.
>
> -- Josh
>
>
>
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>