Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Changing location where checkpoints are saved
From: Constantinos Makassikis (cmakassikis_at_[hidden])
Date: 2009-09-30 13:02:26


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers
indicate higher priority ?

By searching in the archives of the mailing list I found two
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php
(for restarting)

Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-----------------------------
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--------------
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid

OUTPUT of ompi-checkpoint -v 16753
--------------------------------------
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID
17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of
jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Pending - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Running - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference:
ompi_global_snapshot_17036.ckpt

OUTPUT of MPIRUN
----------------
----------------------------
[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--------------------------------------------------------------------------
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054

Does anyone has an idea about what is wrong?

Best regards,

--
Constantinos
Josh Hursey wrote:
> This is described in the C/R User's Guide attached to the webpage below:
>   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>
> Additionally this has been addressed on the users mailing list in the 
> past, so searching around will likely turn up some examples.
>
> -- Josh
>
> On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:
>
>> Dear all,
>>
>> I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. 
>> By default,
>> it seems that checkpoints are saved in $HOME. However, I would prefer 
>> them
>> to be saved on a local disk (e.g.: /tmp).
>>
>> Does anyone know how I can change the location where Open MPI saves 
>> checkpoints?
>>
>>
>> Best regards,
>>
>> -- 
>> Constantinos
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>