Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem in the ORTE notifier framework
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-05-28 03:45:55


To be more complete, we pull Hg from
http://www.open-mpi.org/hg/hgwebdir.cgi/ompi-svn-mirror/ ; are we
mistaken ?

If not, the code in v1.3 seems to be different from the code in the trunk
...

Sylvain

On Thu, 28 May 2009, Nadia Derbey wrote:

> On Tue, 2009-05-26 at 17:24 -0600, Ralph Castain wrote:
>> First, to answer Nadia's question: you will find that the init
>> function for the module is already called when it is selected - see
>> the code in orte/mca/base/notifier_base_select.c, lines 72-76 (in the
>> trunk.
>
> Strange? Our repository is a clone of the trunk?
>>
> It's true that if I "hg update" to v1.3 I see that the fix is there.
>
> Regards,
> Nadia
>
>> It would be a good idea to tie into the sos work to avoid conflicts
>> when it all gets merged back together, assuming that isn't a big
>> problem for you.
>>
>> As for Jeff's suggestion: dealing with the performance hit problem is
>> why I suggested ORTE_NOTIFIER_VERBOSE, modeled after the
>> OPAL_OUTPUT_VERBOSE model. The idea was to compile it in -only- when
>> the system is built for it - maybe using a --with-notifier-verbose
>> configuration option. Frankly, some organizations would happily pay a
>> small performance penalty for the benefits.
>>
>> I would personally recommend that the notifier framework keep the
>> stats so things can be compact and self-contained. We still get
>> atomicity by allowing each framework/component/whatever specify the
>> threshold. Creating yet another system to do nothing more than track
>> error/warning frequencies to decide whether or not to notify seems
>> wasteful.
>>
>> Perhaps worth a phone call to decide path forward?
>>
>>
>> On Tue, May 26, 2009 at 1:06 PM, Jeff Squyres <jsquyres_at_[hidden]>
>> wrote:
>> Nadia --
>>
>> Sorry I didn't get to jump in on the other thread earlier.
>>
>> We have made considerable changes to the notifier framework in
>> a branch to better support "SOS" functionality:
>>
>>
>> https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos
>>
>> Cisco and Indiana U. have been working on this branch for a
>> while. A description of the SOS stuff is here:
>>
>> https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>
>> As for setting up an external web server with hg, don't bother
>> -- just get an account at bitbucket.org. They're free and
>> allow you to host hg repositories there. I've used bitbucket
>> to collaborate on code before it hits OMPI's SVN trunk with
>> both internal and external OMPI developers.
>>
>> We can certainly move the opal-sos repo to bitbucket (or
>> branch again off opal-sos to bitbucket -- whatever makes more
>> sense) to facilitate collaborating with you.
>>
>> Back on topic...
>>
>> I'd actually suggest a combination of what has been discussed
>> in the other thread. The notifier can be the mechanism that
>> actually sends the output message, but it doesn't have to be
>> the mechanism that tracks the stats and decides when to output
>> a message. That can be separate logic, and therefore be more
>> fine-grained (and potentially even specific to the MPI layer).
>>
>> The Big Question will how to do this with zero performance
>> impact when it is not being used. This has always been the
>> difficult issue when trying to implement any kind of
>> monitoring inside the core OMPI performance-sensitive paths.
>> Even adding individual branches has met with resistance (in
>> performance-critical code paths)...
>>
>>
>>
>>
>>
>> On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:
>>
>>
>>
>> Hi,
>>
>> While having a look at the notifier framework under
>> orte, I noticed that
>> the way it is written, the init routine for the
>> selected module cannot
>> be called.
>>
>> Attached is a small patch that fixes this issue.
>>
>> Regards,
>> Nadia
>>
>>
>> <orte_notifier_fix_select.patch><ATT14046023.txt>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> --
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>