Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Changing location where checkpoints are saved
From: Constantinos Makassikis (cmakassikis_at_[hidden])
Date: 2009-11-18 13:27:50


Josh Hursey wrote:
> (Sorry for the excessive delay in replying)
>
> On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:
>
>> Thanks for the reply!
>>
>> Concerning the mca options for checkpointing:
>> - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1
>> values ?
>> - in priority options (e.g.: crs_blcr_priority) do lower numbers
>> indicate higher priority ?
>>
>> By searching in the archives of the mailing list I found two
>> interesting/useful posts:
>> - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php
>> (for different checkpointing schemes)
>> - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php
>> (for restarting)
>>
>> Following indications given in [1], I tried to make each process
>> checkpoint itself in it local /tmp and centralize the resulting
>> checkpoints in /tmp or $HOME:
>>
>> Excerpt from mca-params.conf:
>> -----------------------------
>> snapc_base_store_in_place=0
>> snapc_base_global_snapshot_dir=/tmp or $HOME
>> crs_base_snapshot_dir=/tmp
>>
>> COMMANDS used:
>> --------------
>> mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
>> ompi-checkpoint mpirun_pid
>>
>>
>>
>> OUTPUT of ompi-checkpoint -v 16753
>> --------------------------------------
>> [ic85:17044] orte_checkpoint: Checkpointing...
>> [ic85:17044] PID 17036
>> [ic85:17044] Connected to Mpirun [[42098,0],0]
>> [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process
>> PID 17036
>> [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of
>> jobid [INVALID]
>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>> [ic85:17044] Requested - Global Snapshot Reference:
>> (null)
>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>> [ic85:17044] Pending - Global Snapshot Reference:
>> (null)
>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>> [ic85:17044] Running - Global Snapshot Reference:
>> (null)
>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>> [ic85:17044] File Transfer - Global Snapshot Reference:
>> (null)
>> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
>> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
>> [ic85:17044] Error - Global Snapshot Reference:
>> ompi_global_snapshot_17036.ckpt
>>
>>
>>
>> OUTPUT of MPIRUN
>> ----------------
>> ----------------------------
>> [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with
>> status 3
>> [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with
>> status 3
>> --------------------------------------------------------------------------
>>
>> WARNING: Could not preload specified file: File already exists.
>>
>> Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
>> Host: ic85
>>
>> Will continue attempting to launch the process.
>>
>> --------------------------------------------------------------------------
>>
>> [ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
>> [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054
>
> This is a warning about creating the global snapshot directory
> (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It
> seems to indicate that the directory existed when the file gather
> started.
>
> A couple things to check:
> - Did you clean out the /tmp on all of the nodes with any files
> starting with "opal" or "ompi"?
> - Does the error go away when you set
> (snapc_base_global_snapshot_dir=$HOME)?
> - Could you try running against a v1.3 release? (I wonder if this
> feature has been broken on the trunk)
>
> Let me know what you find. In the next couple days, I'll try to test
> the trunk again with this feature to make sure that it is still
> working on my test machines.
>
> -- Josh
Hello Josh,

I have switched to v1.3 and re-run with
snapc_base_global_snapshot_dir=/tmp or $HOME
with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if the following can be of any help but after ompi-checkpoint
returns there is only a copy of the checkpoint of process of rank 0 in
the global snapshot directory:

$(snapc_base_global_snapshot_dir)/ompi_global_snapshot_XXXX.ckpt/0

So I guess the error occurs during the remote copy phase.

--
Constantinos
>
>
>>
>>
>>
>> Does anyone has an idea about what is wrong?
>>
>>
>> Best regards,
>>
>> -- 
>> Constantinos
>>
>>
>>
>> Josh Hursey wrote:
>>> This is described in the C/R User's Guide attached to the webpage 
>>> below:
>>>  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>
>>> Additionally this has been addressed on the users mailing list in 
>>> the past, so searching around will likely turn up some examples.
>>>
>>> -- Josh
>>>
>>> On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
>>>
>>>> Dear all,
>>>>
>>>> I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS 
>>>> account. By default,
>>>> it seems that checkpoints are saved in $HOME. However, I would 
>>>> prefer them
>>>> to be saved on a local disk (e.g.: /tmp).
>>>>
>>>> Does anyone know how I can change the location where Open MPI saves 
>>>> checkpoints?
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> -- 
>>>> Constantinos
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>