Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Changing location where checkpoints are saved
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-12-09 13:54:22


I took a look at the checkpoint staging and preload functionality. It
seems that the combination of the two is broken on the v1.3 and v1.4
branches. I filed a bug about it so that it would not get lost:
   https://svn.open-mpi.org/trac/ompi/ticket/2139

I also attached a patch to partially fix the problem, but the actual
fix is must more involved. I don't know when I'll get around to
finishing this bug fix for that branch. :(

However, the current development trunk and v1.5 are know to have a
working version of this feature. Can you try the trunk or v1.5 and see
if this fixes the problem?

-- Josh

P.S. If you are interested, we have a slightly better version of the
documentation, hosted at the link below:
   http://osl.iu.edu/research/ft/ompi-cr/

On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote:

> Josh Hursey wrote:
>> (Sorry for the excessive delay in replying)
>>
>> On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:
>>
>>> Thanks for the reply!
>>>
>>> Concerning the mca options for checkpointing:
>>> - are verbosity options (e.g.: crs_base_verbose) limited to 0 and
>>> 1 values ?
>>> - in priority options (e.g.: crs_blcr_priority) do lower numbers
>>> indicate higher priority ?
>>>
>>> By searching in the archives of the mailing list I found two
>>> interesting/useful posts:
>>> - [1] http://www.open-mpi.org/community/lists/users/
>>> 2008/09/6534.php (for different checkpointing schemes)
>>> - [2] http://www.open-mpi.org/community/lists/users/
>>> 2009/05/9385.php (for restarting)
>>>
>>> Following indications given in [1], I tried to make each process
>>> checkpoint itself in it local /tmp and centralize the resulting
>>> checkpoints in /tmp or $HOME:
>>>
>>> Excerpt from mca-params.conf:
>>> -----------------------------
>>> snapc_base_store_in_place=0
>>> snapc_base_global_snapshot_dir=/tmp or $HOME
>>> crs_base_snapshot_dir=/tmp
>>>
>>> COMMANDS used:
>>> --------------
>>> mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
>>> ompi-checkpoint mpirun_pid
>>>
>>>
>>>
>>> OUTPUT of ompi-checkpoint -v 16753
>>> --------------------------------------
>>> [ic85:17044] orte_checkpoint: Checkpointing...
>>> [ic85:17044] PID 17036
>>> [ic85:17044] Connected to Mpirun [[42098,0],0]
>>> [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node
>>> Process PID 17036
>>> [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint
>>> of jobid [INVALID]
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Requested - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Pending - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Running - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] File Transfer - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Error - Global Snapshot
>>> Reference: ompi_global_snapshot_17036.ckpt
>>>
>>>
>>>
>>> OUTPUT of MPIRUN
>>> ----------------
>>> ----------------------------
>>> [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 3
>>> [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 3
>>> --------------------------------------------------------------------------
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
>>> Host: ic85
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>> [ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
>>> [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in
>>> file ../../../../../orte/mca/snapc/full/snapc_full_global.c at
>>> line 1054
>>
>> This is a warning about creating the global snapshot directory
>> (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0).
>> It seems to indicate that the directory existed when the file
>> gather started.
>>
>> A couple things to check:
>> - Did you clean out the /tmp on all of the nodes with any files
>> starting with "opal" or "ompi"?
>> - Does the error go away when you set
>> (snapc_base_global_snapshot_dir=$HOME)?
>> - Could you try running against a v1.3 release? (I wonder if this
>> feature has been broken on the trunk)
>>
>> Let me know what you find. In the next couple days, I'll try to
>> test the trunk again with this feature to make sure that it is
>> still working on my test machines.
>>
>> -- Josh
> Hello Josh,
>
> I have switched to v1.3 and re-run with
> snapc_base_global_snapshot_dir=/tmp or $HOME
> with a clean /tmp.
>
> In both cases I get the same error as before :-(
>
> I don't know if the following can be of any help but after ompi-
> checkpoint
> returns there is only a copy of the checkpoint of process of rank 0 in
> the global snapshot directory:
>
> $(snapc_base_global_snapshot_dir)/ompi_global_snapshot_XXXX.ckpt/0
>
> So I guess the error occurs during the remote copy phase.
>
> --
> Constantinos
>>
>>
>>>
>>>
>>>
>>> Does anyone has an idea about what is wrong?
>>>
>>>
>>> Best regards,
>>>
>>> --
>>> Constantinos
>>>
>>>
>>>
>>> Josh Hursey wrote:
>>>> This is described in the C/R User's Guide attached to the webpage
>>>> below:
>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>>
>>>> Additionally this has been addressed on the users mailing list in
>>>> the past, so searching around will likely turn up some examples.
>>>>
>>>> -- Josh
>>>>
>>>> On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS
>>>>> account. By default,
>>>>> it seems that checkpoints are saved in $HOME. However, I would
>>>>> prefer them
>>>>> to be saved on a local disk (e.g.: /tmp).
>>>>>
>>>>> Does anyone know how I can change the location where Open MPI
>>>>> saves checkpoints?
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --
>>>>> Constantinos
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users