Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] C/R and orte_oob
From: Adrian Reber (adrian_at_[hidden])
Date: 2014-03-07 06:07:08


On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote:
> >>>>> Sorry for delay - yes, that looks like the right direction. I would suggest doing it via the current state machine, though, by simply defining another job or proc state in orte/mca/plm/plm_types.h, and then registering a callback function using the orte_state.add_job[proc]_state(state, function to be called, ORTE_ERR_PRI). Then you can activate it by calling ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the proper order.
> >>>>
> >>>> What is a job/proc in the Open MPI context.
> >>>
> >>> A "job" is the entire application, while a "proc" is just one process in that application. In this case you could use either one as you are checkpointing the entire job, but all this activity is occurring inside each proc. So I'd suggest defining it as a proc state since it only really involves local actions.
> >>>
> >>> If you like, I can define the required code in the trunk and let you fill in the event functionality.
> >>
> >> That would be great.
> >
> > Thanks for your changes. When using --with-ft there are a few compiler
> > errors which I tried to fix with following patch:
> >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
>
> That looks okay, with the only caveat being that you wouldn't ordinarily pass the state_caddy_t into a function. It's just there to pass along the job etc in case the callback function needs to reference something. In this case, I can't think of anything the FT event function would need to know - you just want it to quiet all messaging.

I need to pass the type of state to the ft_event() functions:

enum opal_crs_state_type_t {
    OPAL_CRS_NONE = 0,
    OPAL_CRS_CHECKPOINT = 1,
    OPAL_CRS_RESTART_PRE = 2,
    OPAL_CRS_RESTART = 3, /* RESTART_POST */

so an int is all I need. So I probably need to encode it into *cbdata. Do I
just use an int directly in *cbdata or should it be part of a struct?

                Adrian