I am one of the developers on the Open|SpeedShop (http://www.openspeedshop.org/) project. Lately I've been working on developing a new MRNet/Dyninst based daemon for Open|SS. The daemon is generally working, but I'm running into issue when using it with OpenMPI applications. Since I have zero knowledge of the internal workings of OpenMPI, I'm hoping one of you can provide me some ideas...
My test application is SMG2000:
built and run with OpenMPI v1.2.4 on the Yellowrail system at LANL. Yellowrail is a small-scale verison of the Roadrunner (http://www.lanl.gov/roadrunner/) platform.
If I run SMG2000 by itself on a single node ("mpirun -np 8 smg2000 -n 96 96 96"), I find the job completes within one minute. If, however, I run SMG2000 in Open|SpeedShop with PC sampling enabled, the job runs and runs and runs. If I disable the PC sampling during the run, I find that the job quickly completes normally. Requesting the top ten functions from Open|SS, I find there is an inordinately large amount of time spent in the OpenMPI implementation:
Exclusive CPU time
in seconds. Function (defining location)
110.16 mca_btl_sm_component_progress (libmpi.so.0)
9.94 opal_progress (libopen-pal.so.0)
5.37 mca_bml_r2_progress (libmpi.so.0)
0.78 hypre_SMGResidual (smg2000)
0.72 ompi_request_wait_all (libmpi.so.0)
0.49 hypre_CyclicReduction (smg2000)
0.31 hypre_StructVectorSetConstantValues (smg2000)
0.31 hypre_StructMatrixSetBoxValues (smg2000)
0.30 main (smg2000)
0.20 hypre_SMG2BuildRAPSym (smg2000)
The longer I let SMG2000 run under PC sampling, the more samples I see piled up inside OpenMPI functions. Clearly the Open|SS instrumentation is interfering with the proper execution of the application - the OpenMPI library specifically. But I have been unable to determine the mechanism by which this happens.
To provide you with a little background on how Open|SS collects the PC sampling data... As I mentioned above, there is a daemon built on top of MRNet and Dyninst which runs on each node. This daemon uses Dyninst to attach to the processes in the MPI job and inserts instrumentation into them. This is accomplished via the ptrace() interface on Linux. The Open|SS daemon uses Dyninst to load a data collection DSO into each process. The PC sampling data collection DSO, when initialized, registers a SIGPROF signal handler within the process and then sets up a sampling timer via setitimer(). The timer is typically setup to trigger 100 or 1000 times per second. The SIGPROF handler is a highly optimized, short, bit of code that just squirrels the PC value away for later transport back to Open|SS.
Can any of the OpenMPI developers speculate as to possible mechanisms by which the ptrace() attachment , signal handler, or timer registration and corresponding signal delivery could cause large amounts of time to be spent within the "progress" functions of the OpenMPI library with an apparent lack of any real progress? Any ideas/information would be greatly appreciated.
-- Bill Hachfeld, The Open|SpeedShop Project