Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Changing location where checkpoints are saved
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-11-06 08:28:11


(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:

> Thanks for the reply!
>
> Concerning the mca options for checkpointing:
> - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1
> values ?
> - in priority options (e.g.: crs_blcr_priority) do lower numbers
> indicate higher priority ?
>
> By searching in the archives of the mailing list I found two
> interesting/useful posts:
> - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php
> (for different checkpointing schemes)
> - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php
> (for restarting)
>
> Following indications given in [1], I tried to make each process
> checkpoint itself in it local /tmp and centralize the resulting
> checkpoints in /tmp or $HOME:
>
> Excerpt from mca-params.conf:
> -----------------------------
> snapc_base_store_in_place=0
> snapc_base_global_snapshot_dir=/tmp or $HOME
> crs_base_snapshot_dir=/tmp
>
> COMMANDS used:
> --------------
> mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
> ompi-checkpoint mpirun_pid
>
>
>
> OUTPUT of ompi-checkpoint -v 16753
> --------------------------------------
> [ic85:17044] orte_checkpoint: Checkpointing...
> [ic85:17044] PID 17036
> [ic85:17044] Connected to Mpirun [[42098,0],0]
> [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 17036
> [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
> [ic85:17044] Requested - Global Snapshot Reference:
> (null)
> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
> [ic85:17044] Pending - Global Snapshot Reference:
> (null)
> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
> [ic85:17044] Running - Global Snapshot Reference:
> (null)
> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
> [ic85:17044] File Transfer - Global Snapshot Reference:
> (null)
> [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
> [ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
> [ic85:17044] Error - Global Snapshot Reference:
> ompi_global_snapshot_17036.ckpt
>
>
>
> OUTPUT of MPIRUN
> ----------------
> ----------------------------
> [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with
> status 3
> [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with
> status 3
> --------------------------------------------------------------------------
> WARNING: Could not preload specified file: File already exists.
>
> Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
> Host: ic85
>
> Will continue attempting to launch the process.
>
> --------------------------------------------------------------------------
> [ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
> [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in
> file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line
> 1054

This is a warning about creating the global snapshot directory
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It
seems to indicate that the directory existed when the file gather
started.

A couple things to check:
  - Did you clean out the /tmp on all of the nodes with any files
starting with "opal" or "ompi"?
  - Does the error go away when you set
(snapc_base_global_snapshot_dir=$HOME)?
  - Could you try running against a v1.3 release? (I wonder if this
feature has been broken on the trunk)

Let me know what you find. In the next couple days, I'll try to test
the trunk again with this feature to make sure that it is still
working on my test machines.

-- Josh

>
>
>
> Does anyone has an idea about what is wrong?
>
>
> Best regards,
>
> --
> Constantinos
>
>
>
> Josh Hursey wrote:
>> This is described in the C/R User's Guide attached to the webpage
>> below:
>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>
>> Additionally this has been addressed on the users mailing list in
>> the past, so searching around will likely turn up some examples.
>>
>> -- Josh
>>
>> On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
>>
>>> Dear all,
>>>
>>> I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS
>>> account. By default,
>>> it seems that checkpoints are saved in $HOME. However, I would
>>> prefer them
>>> to be saved on a local disk (e.g.: /tmp).
>>>
>>> Does anyone know how I can change the location where Open MPI
>>> saves checkpoints?
>>>
>>>
>>> Best regards,
>>>
>>> --
>>> Constantinos
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users