Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem in the ORTE notifier framework
From: Nadia Derbey (Nadia.Derbey_at_[hidden])
Date: 2009-05-28 08:12:03


On Thu, 2009-05-28 at 05:57 -0600, Ralph Castain wrote:
> I agree with Terry here about being careful in pursuing this path.
> What I wouldn't want to have happen is to force anyone wanting to be
> notified of error events to have to also turn on peruse, which impacts
> the non-error code path.

Agreed, I missed that part!

Regards,
Nadia
>
> Again, I'm not entirely sure what you are trying to do here. As I
> understood the original RFC, it sounded like you wanted to track
> errors but only report them when they occurred a controlled number of
> times (as opposed to every time). I think this would better be done
> outside of peruse.
>
> If you are trying to track normal performance (e.g., trying to alert
> sys admins when networks aren't running as fast as they should), then
> that probably should be done inside of peruse. However, that
> definitely will impact the critical code path, so Terry's caution is
> definitely a concern.
>
>
> On Thu, May 28, 2009 at 12:55 AM, Nadia Derbey <Nadia.Derbey_at_[hidden]>
> wrote:
> On Wed, 2009-05-27 at 14:25 -0400, Jeff Squyres wrote:
> > Excellent points; Ralph and I chatted about this on the
> phone today --
> > we concur with George.
> >
> > Bull -- would peruse work for you? I think you mentioned
> before that
> > it didn't seem attractive to you.
>
>
> Well, it didn't because from what I understood, the MPI
> program need to
> be changed (register a callback routine for the event,
> activate the
> event, etc), and this is something we wanted to avoid.
>
> Now, if we are allowed to
> 1. define new "internal" PERUSE events,
> 2. internally set the associated callback routines
> why not using peruse? This combined with the orte notifier
> framework,
> could do the job I think.
>
> Regards,
> Nadia
>
>
> > I think George's point is that we
> > already have lots of hooks in place in the PML -- and
> they're called
> > peruse. So if we could use those hooks, then a) they're
> run-time
> > selectable already, and b) there's no additional cost in
> performance
> > critical/not-critical code paths (for the case where these
> stats are
> > not being collected) because PERUSE has been in the code
> base for a
> > long time.
> >
> > I think the idea is that your callbacks could be invoked by
> the peruse
> > hooks and then they can do whatever they want -- increment
> counters,
> > conditionally invoke the ORTE notifier system, etc.
> >
> >
> >
> > On May 27, 2009, at 11:34 AM, George Bosilca wrote:
> >
> > > What is a generic threshold? And what is a counter? We
> have a policy
> > > against such coding standards, and to be honest I would
> like to stick
> > > to it. The reason is that the PML is a very complex piece
> of code, and
> > > I would like to keep it as easy to understand as possible.
> If people
> > > start adding #if/#endif all over the code, we diverging
> from this
> > > goal.
> > >
> > > The only way to make this work is to call the notifier or
> some other
> > > framework in this "slow path" and let this other framework
> do it's own
> > > logic to determine what and when to print. Of course the
> cost of this
> > > is a function call plus an atomic operation (which is
> already not
> > > cheap). It's starting to get expensive, even for a "slow
> path", which
> > > in this particular context is just one insertion in an
> atomic FIFO.
> > >
> > > If instead of counting in number of times we try to send
> the fragment,
> > > and switch to a time base approach, this can be solved
> with the PERUSE
> > > calls. There is a callback when the request is created,
> and another
> > > callback when the first fragment is pushed successfully
> into the
> > > network. Computing the time between these two, allow a
> tool to figure
> > > out how much time the request was waiting in some internal
> queues, and
> > > therefore how much delay this added to the execution time.
> > >
> > > george.
> > >
> > > On May 27, 2009, at 06:59 , Ralph Castain wrote:
> > >
> > > > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
> > > >
> > > > #if WANT_NOTIFIER_VERBOSE
> > > > opal_atomic_increment(counter);
> > > > if (counter > threshold) {
> > > > orte_notifier.api(.....)
> > > > }
> > > > #endif
> > >
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> >
> >
> --
>
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Nadia Derbey <Nadia.Derbey_at_[hidden]>