Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Diagnostoc framework for MPI
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-26 19:16:15


Nadia Derbey wrote:

>What: Warn the administrator when unusual events are occurring too
>frequently.
>
>Why: Such unusual events might be the symptom of some problem that can
>easily be fixed (by a better tuning, for example)
>
>
Before Sun HPC ClusterTools adopted the Open MPI code base (that is, CT6
and earlier), there was some performance analysis support called
MPProf. See
http://docs.sun.com/source/819-4134-10/profile.html#pgfId-999249 . The
key characteristic was supposed to be that it would be very easy to
use: set an environment variable before running; run a report generator
afterwards; report is self explanatory; data volumes were relatively
small and so easy to manage.

One part in particular seemed germane to your RFC: advice on
implementation-specific environment variables. See
http://docs.sun.com/source/819-4134-10/profile.html#pgfId-1000209 . Sun
MPI had instrumentation embedded in it that looked for various
"performance conditions". Then, in post processing, the report
generator would translate that information into user-actionable
feedback. At least, that was the concept. The idea would be that all
user feedback should include:

*) a brief explanation of what happened ("you ran out of postboxes...
see Appendix A.1.b.23 of user guide if you really dare to understand
what this means")
*) an estimate of how important this is ("we think this cost you 10%
performance")
*) a concise description of what to do to improve performance and
discussion of ramifications ("set the environment variable
MPI_NUMPOSTBOX to 256 and rerun, this will cost about 50 Mbyte more
memory per process")

The feedback need not be limited to environment variables or
implementation-specific conditions. E.g., perhaps one could detect when
MPI_Ssend is used in place of MPI_Send and how much performance
(unneeded synchronization) that cost.