230 likes | 355 Views
Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization. Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering
E N D
Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group
Thesis Statement • Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.
Introduction A 80 • Profile information is critical to success of profile-based optimizations • Point Profile - BB count, edge profile, etc. • Path Profile - correlated points • Off-line Path Profiling Methods: • Use static/dynamic instrumentation to gather full path profile • On-line Path Profiling Method: • Interpretation and MRET • Both incur high overhead!! • Slowdown of 2-3x with Pin for BB counting 20 B C D 30 70 E F G Edge Profile: ABDFG 70-50 Path Profile: ABDFG 60 ACDFG 10 …
Performance Monitoring Itanium-2 PMU Features • HPM through on-chip Performance Monitoring Units (PMUs) • Itanium, Pentium 4, PowerPC • Coarse-grained, fine-grained features • Obstacles to PMU profiling • Non-deterministic (sampling) • Sample aliasing • Less information • Compiler analysis can extend PMU information!!! Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.
Contributions • Characterize the information provided by PMU sampling of branch vectors • Characterize the effect compiler analysis on PMU information • Demonstrate the construction of a PMU-based path profiler
PMU Profiling Framework Terminology Branch Vector: Series of addresses from BTB Partial Path: Path of ops in compiler IR Online Annotated Binary PMU Offline Compiler Analysis perfmon interface Partial Path Extensions Kernel Buffer Address Map Branch Vectors … Dominator Analysis Interrupt on kernel buffer overflow Path Profile Generation Partial Paths Branch Vector Hash Table Intermediate File Profile Information
PMU Configuration • Itanium-2 PMU BTB masks • Taken Mask (All, T, NT, None) • Predicted Target Address Mask (All, Correct, Incorrect, None) • Predicted Predicate Mask (All, Correct, Incorrect, None) • Branch Type Mask (All, Indirect, Return, IP-relative) • Configuration depends on goal • Branch prediction performance? Building call graph? • PMU configured to sample only taken branches for path information • Not taken branches can be inferred in control flow graph
Partial Path Extensions Join Point • Compiler view of CFG can be used to extend paths • Extend until point of uncertainty • Up until Join Point • Down until Branch Point BTB Branch Vector 1-2-3-4 1 Partial Path from Branch Vector 2 Extended Partial Path 3 4 Branch Point
Dominator Analysis Join Point • Dominator Analysis • Finds all blocks guaranteed to execute • Partial Path Extensions • Subset of dominator analysis • Constrained to a path BTB Branch Vector 1-2-3-4 1 Partial Path from Branch Vector 2 Basic Blocks added with Dom. Analysis 3 4 Terminology Dominator: u dominates v if all paths from Entry to v include u Post Dominate: u post-dominates v if all paths from v to Exit include u Branch Point
Path Profile Generation BTB Trace • Combine compiler analysis and PMU branch vectors to generate a path profile comparable to software path profiling techniques • Issues: • Path of a branch vector inherently different • Random start and end of path - path ambiguity • Spans boundaries compiler-based paths do not • Number of paths increases exponentially • Must map PMU paths to compiler paths • Region Formation • Split partial paths • Path Matching • Path Crediting Hot Path
Region 1 Region 2 Region 3 Region Formation A • Use region-based paths • Makes total # paths more manageable • Functions can be large • Create loop-based regions • Programs spend most of time in loops • Rules for Region R: • R must be single entry • R may not cross function boundaries • R may not cross loop boundaries B C D L E M N F G O H P Q I J R K S T U V W X Y
Region 1 Region 2 Region 3 Path Matching and Crediting A • Path Matching • Find list of all paths that contain partial path • Path Crediting • Distribute partial path weight equally among matched paths • Ex. ABDLMOP, ABDEFHIK, OPRSUVX B C D L E M N F G O H P Q I J R K S T U V W X Y
Methodology • Experiments run on Itanium-2 with 2.6.10 kernel • Developed tool using perfmon kernel interface and libpfm-3.1 to interface with PMU • Benchmarks • Set of SPEC2000 benchmarks • Compiled with the OpenIMPACT Research Compiler • Compared to full path profile gathered with a Pin path profiling tool
Effect of Sampling Period • Sampling Overhead due to: • Periodic interrupt, copying between buffers, hash table insertion
PMU vs Actual Instruction Distribution • Kullback-Leibler Divergence (Entropy) • d = k=0 pk log2(pk/qk) • Relative measure of distance between two distributions
Code Coverage • Explore how PMU branch vectors translate to code coverage information • Code Coverage Types • Single BB: Simulates PC-sampling • Branch Vectors • Branch Vectors w/ Dom. Analysis • Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors Number of Instructions and Actual Code Covered
Hot Instruction Thresholds • For top 10-30% of instructions, code coverage does well (80-100%) • Drops off at around 40-50% of hot instructions
Stability • Across 20 runs, PMU code coverage varies ~5-10%
Multiple Runs • Regular Sampling: 1) gzip, parser, twolf improve greatly • Randomized Sampling may discover code regular sampling cannot
Partial Path Characteristics • Partial Path extensions increase length ~20% • However, splitting drastically decreases lengths • ~30% on function boundaries, ~20% more on loop back edges
Accuracy Results • Accuracy measured similar to Wall’s weight matching scheme[Wall91] • Threshold = .125%
Conclusion • Motivates and presents initial results and rational for PMU-based profiling • Characterizes branch vector sampling • Improves code coverage > 50% over PC-sampling • Branch vector paths are inter-procedural • Characterizes effect of compiler analysis • Partial path extensions increase length by ~20% • Dominator analysis on branch vectors improve code coverage > 50% • Demonstrates construction of a PMU-based path profiler • ~85% accurate at 1% overhead (at sampling period of 5M) Questions?