Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Changing location where checkpoints are saved
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-12-09 13:54:22


I took a look at the checkpoint staging and preload functionality. It
seems that the combination of the two is broken on the v1.3 and v1.4
branches. I filed a bug about it so that it would not get lost:
   https://svn.open-mpi.org/trac/ompi/ticket/2139

I also attached a patch to partially fix the problem, but the actual
fix is must more involved. I don't know when I'll get around to
finishing this bug fix for that branch. :(

However, the current development trunk and v1.5 are know to have a
working version of this feature. Can you try the trunk or v1.5 and see
if this fixes the problem?

-- Josh

P.S. If you are interested, we have a slightly better version of the
documentation, hosted at the link below:
   http://osl.iu.edu/research/ft/ompi-cr/

On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote:

> Josh Hursey wrote:
>> (Sorry for the excessive delay in replying)
>>
>> On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:
>>
>>> Thanks for the reply!
>>>
>>> Concerning the mca options for checkpointing:
>>> - are verbosity options (e.g.: crs_base_verbose) limited to 0 and
>>> 1 values ?
>>> - in priority options (e.g.: crs_blcr_priority) do lower numbers
>>> indicate higher priority ?
>>>
>>> By searching in the archives of the mailing list I found two
>>> interesting/useful posts:
>>> - [1] http://www.open-mpi.org/community/lists/users/
>>> 2008/09/6534.php (for different checkpointing schemes)
>>> - [2] http://www.open-mpi.org/community/lists/users/
>>> 2009/05/9385.php (for restarting)
>>>
>>> Following indications given in [1], I tried to make each process
>>> checkpoint itself in it local /tmp and centralize the resulting
>>> checkpoints in /tmp or $HOME:
>>>
>>> Excerpt from mca-params.conf:
>>> -----------------------------
>>> snapc_base_store_in_place=0
>>> snapc_base_global_snapshot_dir=/tmp or $HOME
>>> crs_base_snapshot_dir=/tmp
>>>
>>> COMMANDS used:
>>> --------------
>>> mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
>>> ompi-checkpoint mpirun_pid
>>>
>>>
>>>
>>> OUTPUT of ompi-checkpoint -v 16753
>>> --------------------------------------
>>> [ic85:17044] orte_checkpoint: Checkpointing...
>>> [ic85:17044] PID 17036
>>> [ic85:17044] Connected to Mpirun [[42098,0],0]
>>> [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node
>>> Process PID 17036
>>> [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint
>>> of jobid [INVALID]
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Requested - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Pending - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Running - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] File Transfer - Global Snapshot
>>> Reference: (null)
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command
>>> message.
>>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>>> [ic85:17044] Error - Global Snapshot
>>> Reference: ompi_global_snapshot_17036.ckpt
>>>
>>>
>>>
>>> OUTPUT of MPIRUN
>>> ----------------
>>> ----------------------------
>>> [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 3
>>> [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with
>>> status 3
>>> --------------------------------------------------------------------------
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
>>> Host: ic85
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>> [ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
>>> [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in
>>> file ../../../../../orte/mca/snapc/full/snapc_full_global.c at
>>> line 1054
>>
>> This is a warning about creating the global snapshot directory
>> (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0).
>> It seems to indicate that the directory existed when the file
>> gather started.
>>
>> A couple things to check:
>> - Did you clean out the /tmp on all of the nodes with any files
>> starting with "opal" or "ompi"?
>> - Does the error go away when you set
>> (snapc_base_global_snapshot_dir=$HOME)?
>> - Could you try running against a v1.3 release? (I wonder if this
>> feature has been broken on the trunk)
>>
>> Let me know what you find. In the next couple days, I'll try to
>> test the trunk again with this feature to make sure that it is
>> still working on my test machines.
>>
>> -- Josh
> Hello Josh,
>
> I have switched to v1.3 and re-run with
> snapc_base_global_snapshot_dir=/tmp or $HOME
> with a clean /tmp.
>
> In both cases I get the same error as before :-(
>
> I don't know if the following can be of any help but after ompi-
> checkpoint
> returns there is only a copy of the checkpoint of process of rank 0 in
> the global snapshot directory:
>
> $(snapc_base_global_snapshot_dir)/ompi_global_snapshot_XXXX.ckpt/0
>
> So I guess the error occurs during the remote copy phase.
>
> --
> Constantinos
>>
>>
>>>
>>>
>>>
>>> Does anyone has an idea about what is wrong?
>>>
>>>
>>> Best regards,
>>>
>>> --
>>> Constantinos
>>>
>>>
>>>
>>> Josh Hursey wrote:
>>>> This is described in the C/R User's Guide attached to the webpage
>>>> below:
>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>>
>>>> Additionally this has been addressed on the users mailing list in
>>>> the past, so searching around will likely turn up some examples.
>>>>
>>>> -- Josh
>>>>
>>>> On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS
>>>>> account. By default,
>>>>> it seems that checkpoints are saved in $HOME. However, I would
>>>>> prefer them
>>>>> to be saved on a local disk (e.g.: /tmp).
>>>>>
>>>>> Does anyone know how I can change the location where Open MPI
>>>>> saves checkpoints?
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --
>>>>> Constantinos
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users