Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Diagnostoc framework for MPI
From: Nadia Derbey (Nadia.Derbey_at_[hidden])
Date: 2009-05-26 08:11:33


On Tue, 2009-05-26 at 05:35 -0600, Ralph Castain wrote:
> Hi Nadia
>
> We actually have a framework in the system for this purpose, though it
> might require some minor modifications to do precisely what you
> describe. It is the ORTE "notifier" framework - you will find it at
> orte/mca/notifier. There are several components, each of which
> supports a different notification mechanism (e.g., message into the
> sys log, smtp, and even "twitter").

Ralph,

Thanks a lot for your detailed answer. I'll have a look at the notifier
framework to see if it could serve our purpose. Actually, form what you
describe, looks like it does.

Regards,
Nadia
>
> The system works by adding orte_notifier calls to the OMPI code
> wherever we deem it advisable to alert someone. For example, if we
> think a sys admin might want to be alerted when the number of IB send
> retries exceeds some limit, we add a call to orte_notifier to the IB
> code with:
>
> if (#retries > threshold) {
> orte_notifier.xxx(....);
> }
>
> I believe we could easily extend this to support your proposed
> functionality. A couple of possibilities that immediately spring to
> mind would be:
>
> 1. you could create a new component (or we could modify the existing
> ones) that tracks how many times it is called for a given error, and
> only actually issues a notification for that specific error when the
> count exceeds a threshold. The negative to this approach is that the
> threshold would be uniform across all errors.
>
> 2. we could extend the current notifier APIs to add a threshold count
> upon which the notification is to be sent, perhaps creating a new
> macro ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its
> arguments. We could then let each OMPI framework have a new
> "threshold" MCA param, thus allowing the sys admins to "tune" the
> frequency of error reporting by framework. Of course, we could let
> them get as detailed here as you want - they could even have
> "threshold" params for each component, function, or whatever. This
> would be combined with #1 above to alert only when the count exceeded
> the threshold for that specific error message.
>
> I'm sure you and others will come up with additional (probably better)
> ways of implementing this extension. My point here was simply to
> ensure you knew that the basic mechanism already exists, and to
> stimulate some thought as to how to use it for your proposed purpose.
>
> I would be happy to help you do so as this is something we (LANL) have
> put at a high priority - our sys admins on the large clusters really
> need the help.
>
> HTH
> Ralph
>
>
> On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <Nadia.Derbey_at_[hidden]>
> wrote:
> What: Warn the administrator when unusual events are occurring
> too
> frequently.
>
> Why: Such unusual events might be the symptom of some problem
> that can
> easily be fixed (by a better tuning, for example)
>
> Where: Adds a new ompi framework
>
> -------------------------------------------------------------------
>
> Description:
>
> The objective of the Open MPI library is to make applications
> run to
> completion, given that no fatal error is encountered.
> In some situations, unusual events may occur. Since these
> events are not
> considered to be fatal enough, the library arbitrarily chooses
> to bypass
> them using a software mechanism, instead of actually stopping
> the
> application. But even though this choice helps in completing
> the
> application, it may frequently result in significant
> performance
> degradation. This is not an issue if such “unusual events”
> don't occur
> too frequently. But if they actually do, that might be
> representative of
> a real problem that could sometimes be easily avoided.
>
> For example, when mca_pml_ob1_send_request_start() starts a
> send request
> and faces a resource shortage, it silently calls
> add_request_to_send_pending() to queue that send request into
> the list
> of pending send requests in order to process it later on. If
> an adapting
> mechanism is not provided at runtime to increase the receive
> queue
> length, at least a message can be sent to the administrator to
> let him
> do the tuning by hand before the next run.
>
> We had a look at other tracing utilities (like PMPI, PERUSE,
> VT), but
> found them either too high level or too intrusive at the
> application
> level.
>
> The “diagnostic framework” we'd like to propose would help
> capturing
> such “unusual events” and tracing them, while having a very
> low impact
> on the performances. This is obtained by defining tracing
> routines that
> can be called from the ompi code. The collected events are
> aggregated
> per MPI process and only traced if a threshold has been
> reached. Another
> threshold (time threshold) can be used to condition subsequent
> traces
> generation for an already traced event.
>
> This is obtained by defining 2 mca parameters and a rule:
> . the count threshold C
> . the time delay T
> The rule is: an event will only be traced if it happened N
> times, and it
> won't be traced more than once every T seconds.
>
> Thus, events happening at a very low rate will never generate
> a trace
> except one at MPI_Finalize summarizing:
> [time] At finalize : 23 times : pre-allocated buffers all
> full, calling
> malloc
>
> Those happening "a little too much" will sometimes generate a
> trace
> saying something like:
> [time] 1000 warnings : could not send in openib now, delaying
> [time+12345 sec] 1000 warnings : could not send in openib now,
> delaying
>
> And events occurring at a high frequency will only generate a
> message
> every T seconds saying:
> [time] 1000 warnings : adding buffers in the SRQ
> [time+T] 1,234,567 warnings (in T seconds) : adding buffers
> in the SRQ
> [time+2*T] 2,345,678 warnings (in T seconds) : adding buffers
> in the SRQ
>
> The count threshold and time delay are defined per event.
> They can also be defined as MCA parameters. In that case, the
> mca
> parameter value overrides the per event values.
>
> The following information are traced too:
> . job family
> . the local job id
> . the job vpid
>
> Another aspect of performance savings is that a mechanism ala
> show_help() can be used in order to let the HNP actually do
> the job.
>
> We started the implementation of this feature, so patches are
> available if
> needed. We are currently trying to setup hgweb on an external
> server.
>
> Since I'm an Open MPI newbie, I'm submitting this RFC to have
> your
> opinion about its usefulness, or even to know if there's an
> already
> existing mechanism to do this job.
>
> Regards,
> Nadia
>
> --
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Nadia Derbey <Nadia.Derbey_at_[hidden]>