Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem in the ORTE notifier framework
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-26 19:47:24


Sure, I can setup a webex (with international dialins) if it would be
useful.

On May 26, 2009, at 7:24 PM, Ralph Castain wrote:

> First, to answer Nadia's question: you will find that the init
> function for the module is already called when it is selected - see
> the code in orte/mca/base/notifier_base_select.c, lines 72-76 (in
> the trunk.
>
> It would be a good idea to tie into the sos work to avoid conflicts
> when it all gets merged back together, assuming that isn't a big
> problem for you.
>
> As for Jeff's suggestion: dealing with the performance hit problem
> is why I suggested ORTE_NOTIFIER_VERBOSE, modeled after the
> OPAL_OUTPUT_VERBOSE model. The idea was to compile it in -only- when
> the system is built for it - maybe using a --with-notifier-verbose
> configuration option. Frankly, some organizations would happily pay
> a small performance penalty for the benefits.
>
> I would personally recommend that the notifier framework keep the
> stats so things can be compact and self-contained. We still get
> atomicity by allowing each framework/component/whatever specify the
> threshold. Creating yet another system to do nothing more than track
> error/warning frequencies to decide whether or not to notify seems
> wasteful.
>
> Perhaps worth a phone call to decide path forward?
>
>
> On Tue, May 26, 2009 at 1:06 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
> Nadia --
>
> Sorry I didn't get to jump in on the other thread earlier.
>
> We have made considerable changes to the notifier framework in a
> branch to better support "SOS" functionality:
>
> https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos
>
> Cisco and Indiana U. have been working on this branch for a while.
> A description of the SOS stuff is here:
>
> https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>
> As for setting up an external web server with hg, don't bother --
> just get an account at bitbucket.org. They're free and allow you to
> host hg repositories there. I've used bitbucket to collaborate on
> code before it hits OMPI's SVN trunk with both internal and external
> OMPI developers.
>
> We can certainly move the opal-sos repo to bitbucket (or branch
> again off opal-sos to bitbucket -- whatever makes more sense) to
> facilitate collaborating with you.
>
> Back on topic...
>
> I'd actually suggest a combination of what has been discussed in the
> other thread. The notifier can be the mechanism that actually sends
> the output message, but it doesn't have to be the mechanism that
> tracks the stats and decides when to output a message. That can be
> separate logic, and therefore be more fine-grained (and potentially
> even specific to the MPI layer).
>
> The Big Question will how to do this with zero performance impact
> when it is not being used. This has always been the difficult issue
> when trying to implement any kind of monitoring inside the core OMPI
> performance-sensitive paths. Even adding individual branches has
> met with resistance (in performance-critical code paths)...
>
>
>
>
> On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:
>
> Hi,
>
> While having a look at the notifier framework under orte, I noticed
> that
> the way it is written, the init routine for the selected module cannot
> be called.
>
> Attached is a small patch that fixes this issue.
>
> Regards,
> Nadia
>
> <orte_notifier_fix_select.patch><ATT14046023.txt>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems