1 / 18

Instrumentation and Performance Analysis for Finding Memory Bottlenecks

University of Maryland. Thomas J. Watson. Research Center. Instrumentation and Performance Analysis for Finding Memory Bottlenecks. Jeff Hollingsworth Luiz Derose K Ekanadham. Using Data Cache Sampling. Hardware Requirements: Periodic interrupt on cache miss

marged
Download Presentation

Instrumentation and Performance Analysis for Finding Memory Bottlenecks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of Maryland Thomas J. Watson Research Center Instrumentation and Performance Analysis for Finding Memory Bottlenecks Jeff Hollingsworth Luiz Derose K Ekanadham

  2. Using Data Cache Sampling • Hardware Requirements: • Periodic interrupt on cache miss • Ability to determine miss address • Associate count with each object • Variable or dynamically allocated memory • Interrupt after every n cache misses • Obtain address of miss • Find object containing it and increment count • Advantage: simplicity

  3. Experimental Evaluation • Implemented in simulation • Simulator uses ATOM binary rewriting tool • Instrument load/stores for cache simulation • Instrument basic blocks for virtual cycle count • Simulates necessary hardware support • Sampling and n-way search run under simulation • Tested using SPEC 95 applications • tomcatv, swim, su2cor, mgrid, applu, compress, ijpeg • sampled 1 in 50,000 misses

  4. Quality of Results Application Variable Actual Sample Rank % Rank % tomcatv RY 1 22.5 2 17.6 RX 2 1 37.1 22.5 AA 3 15.0 5 10.1 DD 4 10.0 3 15.0 X 5 10.0 6 9.8 Y 6 10.0 7 0.2 D 7 10.0 4 10.2 applu A 1 22.9 2 23.0 B 2 22.9 3 19.9 C 3 22.6 1 25.8 D 4 17.4 4 16.7 rsd 5 6.9 5 7.7

  5. Application Variable Actual Sample Rank % Rank % tomcatv RY 1 22.5 1 22.6 RX 2 22.5 2 22.5 AA 3 15.0 3 15.6 DD 4 10.0 7 9.4 X 5 10.0 6 9.7 Y 6 10.0 4 10.5 D 7 10.0 5 9.8 Varying Sampling Interval Lesson re-learned: randomly vary sampling interval

  6. Cache Misses Due to Instrumentation

  7. Instrumentation Overhead

  8. Misses Over Time in Applu

  9. Simulation Overhead

  10. Sigma Goals • A Research project • Less of a production tool than others from ACTC • Family of tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and openMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking

  11. Approach • Run instrumented program • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Memory profiler • share of accesses due to each data structure • Cache Prediction Tool • Predict cache misses using symbolic equations • Detailed simulator • Full discrete event simulator

  12. Cache Prediction Tool • Predict cache misses • Operate on compact traces • Only expand to full trace if needed • Use algorithms developed for compilers • Re-use vectors • Cache miss equations • Capacity, cold, and conflict misses are identified

  13. Iteration Space • Re-use vectors • defines points in the iteration space that access the same data • Miss equations • describe points in interaction space that cause misses on conflicts

  14. dumpMap .addr ProgramExecution trace files Instrumentedbinary CacheSimulator PredictionTool MemoryRef Tool Structure of SIGMA Data Collection source files SigmaCompile/Link .lst files

  15. RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride

  16. Trace Information • Compression ratio a function of regularity • Slowdown depends on fraction of instructions that load/store memory

  17. Cache Prediction Tool • Use compressed traces • Convert memory refs back to array refs • Solve Cache Miss Equations • computer re-use vectors • define misses as a system of linear equations • use Omega library to solve • Provides • count of misses • information about iterations that cause misses

  18. Using Dyninst to Gather Data • Extend dyninst to support memory ops • Load/store/prefetch instrumentation points • Done and working on Power and SPARC • Extend dyninst AST to include effective addr • Allows code to use memory address • Dyninst for SIGMA Instrumentation provides • Multi-platform support • Dynamic control of instrumentation • Selection of specific functions, loops, memory ops • Possible use of CFGs to optimize instrumentation

More Related