Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem in the ORTE notifier framework
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-26 15:06:45


Nadia --

Sorry I didn't get to jump in on the other thread earlier.

We have made considerable changes to the notifier framework in a
branch to better support "SOS" functionality:

     https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

Cisco and Indiana U. have been working on this branch for a while. A
description of the SOS stuff is here:

     https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

As for setting up an external web server with hg, don't bother -- just
get an account at bitbucket.org. They're free and allow you to host
hg repositories there. I've used bitbucket to collaborate on code
before it hits OMPI's SVN trunk with both internal and external OMPI
developers.

We can certainly move the opal-sos repo to bitbucket (or branch again
off opal-sos to bitbucket -- whatever makes more sense) to facilitate
collaborating with you.

Back on topic...

I'd actually suggest a combination of what has been discussed in the
other thread. The notifier can be the mechanism that actually sends
the output message, but it doesn't have to be the mechanism that
tracks the stats and decides when to output a message. That can be
separate logic, and therefore be more fine-grained (and potentially
even specific to the MPI layer).

The Big Question will how to do this with zero performance impact when
it is not being used. This has always been the difficult issue when
trying to implement any kind of monitoring inside the core OMPI
performance-sensitive paths. Even adding individual branches has met
with resistance (in performance-critical code paths)...

On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:

> Hi,
>
> While having a look at the notifier framework under orte, I noticed
> that
> the way it is written, the init routine for the selected module cannot
> be called.
>
> Attached is a small patch that fixes this issue.
>
> Regards,
> Nadia
>
> <orte_notifier_fix_select.patch><ATT14046023.txt>

-- 
Jeff Squyres
Cisco Systems