Hi Nadia

We actually have a framework in the system for this purpose, though it might require some minor modifications to do precisely what you describe. It is the ORTE "notifier" framework - you will find it at orte/mca/notifier. There are several components, each of which supports a different notification mechanism (e.g., message into the sys log, smtp, and even "twitter").

The system works by adding orte_notifier calls to the OMPI code wherever we deem it advisable to alert someone. For example, if we think a sys admin might want to be alerted when the number of IB send retries exceeds some limit, we add a call to orte_notifier to the IB code with:

if (#retries > threshold) {
    orte_notifier.xxx(....);
}

I believe we could easily extend this to support your proposed functionality. A couple of possibilities that immediately spring to mind would be:

1. you could create a new component (or we could modify the existing ones) that tracks how many times it is called for a given error, and only actually issues a notification for that specific error when the count exceeds a threshold. The negative to this approach is that the threshold would be uniform across all errors.

2. we could extend the current notifier APIs to add a threshold count upon which the notification is to be sent, perhaps creating a new macro ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its arguments. We could then let each OMPI framework have a new "threshold" MCA param, thus allowing the sys admins to "tune" the frequency of error reporting by framework. Of course, we could let them get as detailed here as you want - they could even have "threshold" params for each component, function, or whatever. This would be combined with #1 above to alert only when the count exceeded the threshold for that specific error message.

I'm sure you and others will come up with additional (probably better) ways of implementing this extension. My point here was simply to ensure you knew that the basic mechanism already exists, and to stimulate some thought as to how to use it for your proposed purpose.

I would be happy to help you do so as this is something we (LANL) have put at a high priority - our sys admins on the large clusters really need the help.

HTH
Ralph


On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <Nadia.Derbey@bull.net> wrote:
What: Warn the administrator when unusual events are occurring too
frequently.

Why: Such unusual events might be the symptom of some problem that can
easily be fixed (by a better tuning, for example)

Where: Adds a new ompi framework

-------------------------------------------------------------------

Description:

The objective of the Open MPI library is to make applications run to
completion, given that no fatal error is encountered.
In some situations, unusual events may occur. Since these events are not
considered to be fatal enough, the library arbitrarily chooses to bypass
them using a software mechanism, instead of actually stopping the
application. But even though this choice helps in completing the
application, it may frequently result in significant performance
degradation. This is not an issue if such “unusual events” don't occur
too frequently. But if they actually do, that might be representative of
a real problem that could sometimes be easily avoided.

For example, when mca_pml_ob1_send_request_start() starts a send request
and faces a resource shortage, it silently calls
add_request_to_send_pending() to queue that send request into the list
of pending send requests in order to process it later on. If an adapting
mechanism is not provided at runtime to increase the receive queue
length, at least a message can be sent to the administrator to let him
do the tuning by hand before the next run.

We had a look at other tracing utilities (like PMPI, PERUSE, VT), but
found them either too high level or too intrusive at the application
level.

The “diagnostic framework” we'd like to propose would help capturing
such “unusual events” and tracing them, while having a very low impact
on the performances. This is obtained by defining tracing routines that
can be called from the ompi code. The collected events are aggregated
per MPI process and only traced if a threshold has been reached. Another
threshold (time threshold) can be used to condition subsequent traces
generation for an already traced event.

This is obtained by defining 2 mca parameters and a rule:
. the count threshold C
. the time delay T
The rule is: an event will only be traced if it happened N times, and it
won't be traced more than once every T seconds.

Thus, events happening at a very low rate will never generate a trace
except one at MPI_Finalize summarizing:
[time] At finalize : 23 times : pre-allocated buffers all full, calling
malloc

Those happening "a little too much" will sometimes generate a trace
saying something like:
[time] 1000 warnings : could not send in openib now, delaying
[time+12345 sec] 1000 warnings : could not send in openib now, delaying

And events occurring at a high frequency will only generate a message
every T seconds saying:
[time]     1000 warnings : adding buffers in the SRQ
[time+T]   1,234,567 warnings (in T seconds) : adding buffers in the SRQ
[time+2*T] 2,345,678 warnings (in T seconds) : adding buffers in the SRQ

The count threshold and time delay are defined per event.
They can also be defined as MCA parameters. In that case, the mca
parameter value overrides the per event values.

The following information are traced too:
 . job family
 . the local job id
 . the job vpid

Another aspect of performance savings is that a mechanism ala
show_help() can be used in order to let the HNP actually do the job.

We started the implementation of this feature, so patches are available if
needed. We are currently trying to setup hgweb on an external server.

Since I'm an Open MPI newbie, I'm submitting this RFC to have your
opinion about its usefulness, or even to know if there's an already
existing mechanism to do this job.

Regards,
Nadia

--
Nadia Derbey <Nadia.Derbey@bull.net>

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel