Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OPAL_CRS_* meaning
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2014-02-17 16:43:08


These values indicate the current state of the checkpointing lifecycle. In
particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
others are used by the INC mechanism). In the opal_crs.checkpoint() call
the checkpointer will capture the program state and it is possible to
emerge from this function in one of two scenarios. Either we are continuing
execution in the original process (Continue state), or we are resuming
execution from a checkpointed state (Restart state).

So if the checkpoint was successful, and you are not restarting the process
then you want OPAL_CRS_CONTINUE.

If the process is being restarted from a checkpoint file, then we should
emerge from this function setting the state to OPAL_CRS_RESTART.

The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
the components to prepare for checkpoint (we probably should have called it
OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
You can see it used in the opal_cr_inc_core_prep() function in
opal/runtime/opal_cr.c

-- Josh

On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber <adrian_at_[hidden]> wrote:

> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
>
> They are probably used to communicate the state of the CRS modules.
> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> CRS module supposed to set this to if the checkpoint was successful.
>
> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
>
> Adrian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey