Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] OpenMPI Performance Problem with Open|SpeedShop
From: William Hachfeld (wdh_at_[hidden])
Date: 2009-01-09 00:39:23

OpenMPI Developers,

I am one of the developers on the Open|SpeedShop (
) project. Lately I've been working on developing a new MRNet/Dyninst
based daemon for Open|SS. The daemon is generally working, but I'm
running into issue when using it with OpenMPI applications. Since I
have zero knowledge of the internal workings of OpenMPI, I'm hoping
one of you can provide me some ideas...

My test application is SMG2000:

built and run with OpenMPI v1.2.4 on the Yellowrail system at LANL.
Yellowrail is a small-scale verison of the Roadrunner (
) platform.

If I run SMG2000 by itself on a single node ("mpirun -np 8 smg2000 -n
96 96 96"), I find the job completes within one minute. If, however, I
run SMG2000 in Open|SpeedShop with PC sampling enabled, the job runs
and runs and runs. If I disable the PC sampling during the run, I find
that the job quickly completes normally. Requesting the top ten
functions from Open|SS, I find there is an inordinately large amount
of time spent in the OpenMPI implementation:

        Exclusive CPU time
        in seconds. Function (defining location)

        110.16 mca_btl_sm_component_progress (
        9.94 opal_progress (
        5.37 mca_bml_r2_progress (
        0.78 hypre_SMGResidual (smg2000)
        0.72 ompi_request_wait_all (
        0.49 hypre_CyclicReduction (smg2000)
        0.31 hypre_StructVectorSetConstantValues (smg2000)
        0.31 hypre_StructMatrixSetBoxValues (smg2000)
        0.30 main (smg2000)
        0.20 hypre_SMG2BuildRAPSym (smg2000)

The longer I let SMG2000 run under PC sampling, the more samples I see
piled up inside OpenMPI functions. Clearly the Open|SS instrumentation
is interfering with the proper execution of the application - the
OpenMPI library specifically. But I have been unable to determine the
mechanism by which this happens.

To provide you with a little background on how Open|SS collects the PC
sampling data... As I mentioned above, there is a daemon built on top
of MRNet and Dyninst which runs on each node. This daemon uses Dyninst
to attach to the processes in the MPI job and inserts instrumentation
into them. This is accomplished via the ptrace() interface on Linux.
The Open|SS daemon uses Dyninst to load a data collection DSO into
each process. The PC sampling data collection DSO, when initialized,
registers a SIGPROF signal handler within the process and then sets up
a sampling timer via setitimer(). The timer is typically setup to
trigger 100 or 1000 times per second. The SIGPROF handler is a highly
optimized, short, bit of code that just squirrels the PC value away
for later transport back to Open|SS.

Can any of the OpenMPI developers speculate as to possible mechanisms
by which the ptrace() attachment , signal handler, or timer
registration and corresponding signal delivery could cause large
amounts of time to be spent within the "progress" functions of the
OpenMPI library with an apparent lack of any real progress? Any ideas/
information would be greatly appreciated.

-- Bill Hachfeld, The Open|SpeedShop Project