Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] OPAL_CRS_* meaning
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2014-02-18 11:22:28


Just replied to your other email before seeing this. Take a look at those
comments and let me know if that helps differentiate those interfaces.

On Tue, Feb 18, 2014 at 5:28 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]
> wrote:

> opal_crs.checkpoint() is not used to restart the process, but it does
> return in two different cases:
>
> - in the "continue" case, opal_crs.checkpoint() returns in the original
> process and keeps executing the same process and then, IIRC, invokes
> opal_crs.continue().
>
> - in the "restart" case, opal_crs.checkpoint() returns into a new process
> and then, IIRC, invokes opal_crs.restart().
>
>
> On Feb 18, 2014, at 5:29 AM, Adrian Reber <adrian_at_[hidden]> wrote:
>
> > I should have read this email before answering the other.
> >
> > So opal_crs.checkpoint() is used to checkpoint the process as well as
> > restart the process? I would have expected opal_crs.restart() is used
> > for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
> > only checkpoint and restart() seems to only restart. The comment in
> > opal/mca/crs/crs.h says the same as you say.
> >
> >
> > On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
> >> These values indicate the current state of the checkpointing lifecycle.
> In
> >> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
> >> others are used by the INC mechanism). In the opal_crs.checkpoint() call
> >> the checkpointer will capture the program state and it is possible to
> >> emerge from this function in one of two scenarios. Either we are
> continuing
> >> execution in the original process (Continue state), or we are resuming
> >> execution from a checkpointed state (Restart state).
> >>
> >> So if the checkpoint was successful, and you are not restarting the
> process
> >> then you want OPAL_CRS_CONTINUE.
> >>
> >> If the process is being restarted from a checkpoint file, then we should
> >> emerge from this function setting the state to OPAL_CRS_RESTART.
> >>
> >> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all
> of
> >> the components to prepare for checkpoint (we probably should have
> called it
> >> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at
> all.
> >> You can see it used in the opal_cr_inc_core_prep() function in
> >> opal/runtime/opal_cr.c
> >>
> >> -- Josh
> >>
> >>
> >>
> >> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> >>
> >>> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >>>
> >>> They are probably used to communicate the state of the CRS modules.
> >>> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> >>> CRS module supposed to set this to if the checkpoint was successful.
> >>>
> >>> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
> >>>
> >>> Adrian
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >>
> >> --
> >> Joshua Hursey
> >> Assistant Professor of Computer Science
> >> University of Wisconsin-La Crosse
> >> http://cs.uwlax.edu/~jjhursey
> >
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey