On Feb 13, 2014, at 11:26 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote:
>> On Feb 6, 2014, at 2:16 PM, Adrian Reber <adrian_at_[hidden]> wrote:
>>> Josh explained it to me a few days ago, that after a checkpoint has been
>>> received TCP should no longer be used to not lose any messages. The
>>> communication happens over named pipes and therefore (I think) OOB
>>> ft_event() is used to quite anything besides the pipes. This all seems
>>> to work but I was just confused as the functions for ft_event()
>>> in oob/tcp and oob/ud do not seem to contain any functionality.
>>> So do I try to fix the ft_event() function in oob/base/ to call the
>>> registered ft_event() function which does nothing or do I just remove
>>> the call to orte oob ft_event().
>> Sounds like you'll need to tell the OOB components to stop processing messages, so that will require that you insert an event into the system. You have to account for two things:
>> (a) the OOB base and OOB components are operating on the orte_event_base, but
>> (b) each OOB component can have multiple active modules (one per NIC) that are operating on their own event base/thread.
>> So you have to start by pushing an event that calls the OOB base, which then loops across the components calling their ft_event interface. Each component would then have to create an event for each active module, inserting that event into the module's event base/thread. When activated, each module would have to shutdown its message engine, and activate another event to notify its component that all is quiet.
>> Once a component finds out that all its modules are quiet, it would then have to activate an event to the OOB base. Once the OOB base sees all components report quiet, then it would have to activate an event to take you to the next step in your process.
>> In other words, you need to turn the quieting process into its own set of states and run it through the state machine. This is the only way to guarantee that you'll keep things orderly, and is the major change needed in the C/R procedure as it flows thru ORTE. You can't just progress thru a set of function calls as you'll inevitably run into a roadblock requiring that you wait for an event-driven process to complete.
> I tried to implement something like you described. It is not yet event
> driven, but before continuing I wanted to get some feedback if it is at
> least the right start:
> I looked at the other ORTE_OOB_* macros and tried to model my
> functionality a bit after what I have seen there. Right now it is still
> a simple function which just tries to call ft_event() on all oob
> components. Does this look right so far?
Sorry for delay - yes, that looks like the right direction. I would suggest doing it via the current state machine, though, by simply defining another job or proc state in orte/mca/plm/plm_types.h, and then registering a callback function using the orte_state.add_job[proc]_state(state, function to be called, ORTE_ERR_PRI). Then you can activate it by calling ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the proper order.
> devel mailing list