Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem in the ORTE notifier framework
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-27 14:25:57

Excellent points; Ralph and I chatted about this on the phone today --
we concur with George.

Bull -- would peruse work for you? I think you mentioned before that
it didn't seem attractive to you. I think George's point is that we
already have lots of hooks in place in the PML -- and they're called
peruse. So if we could use those hooks, then a) they're run-time
selectable already, and b) there's no additional cost in performance
critical/not-critical code paths (for the case where these stats are
not being collected) because PERUSE has been in the code base for a
long time.

I think the idea is that your callbacks could be invoked by the peruse
hooks and then they can do whatever they want -- increment counters,
conditionally invoke the ORTE notifier system, etc.

On May 27, 2009, at 11:34 AM, George Bosilca wrote:

> What is a generic threshold? And what is a counter? We have a policy
> against such coding standards, and to be honest I would like to stick
> to it. The reason is that the PML is a very complex piece of code, and
> I would like to keep it as easy to understand as possible. If people
> start adding #if/#endif all over the code, we diverging from this
> goal.
> The only way to make this work is to call the notifier or some other
> framework in this "slow path" and let this other framework do it's own
> logic to determine what and when to print. Of course the cost of this
> is a function call plus an atomic operation (which is already not
> cheap). It's starting to get expensive, even for a "slow path", which
> in this particular context is just one insertion in an atomic FIFO.
> If instead of counting in number of times we try to send the fragment,
> and switch to a time base approach, this can be solved with the PERUSE
> calls. There is a callback when the request is created, and another
> callback when the first fragment is pushed successfully into the
> network. Computing the time between these two, allow a tool to figure
> out how much time the request was waiting in some internal queues, and
> therefore how much delay this added to the execution time.
> george.
> On May 27, 2009, at 06:59 , Ralph Castain wrote:
> > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
> >
> > opal_atomic_increment(counter);
> > if (counter > threshold) {
> > orte_notifier.api(.....)
> > }
> > #endif
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems