Improving the Speed and Quality of Architectural Performance Evaluation

Vijay S. Pai with contributions from: Derek Schuff, Milind Kulkarni Improving the Speed and Quality of Architectural Performance Evaluation Electrical and Computer Engineering Purdue University

Outline • Intro to Reuse Distance Analysis • Contributions • Multicore-Aware Reuse Distance Analysis • Design • Results • Sampled Parallel Reuse Distance Analysis • Design: Sampling, Parallelisim • Results • Application: selection of low-locality code

Reuse Distance Analysis • Reuse Distance Analysis (RDA): architecture-neutral locality profile • Number of distinct data referenced between use and reuse of data element • Elements can be memory pages, disk blocks, cache blocks, etc • Machine-independent model of locality • Predicts hit ratio in any size fully-associative LRU cache • Hit ratio in cache with X blocks = % of references with RD < X

Reuse Distance Analysis • Applications in performance modeling, optimization • Multiprogramming/scheduling interaction, phase prediction • Cache hint generation, restructuring code, data layout

Reuse Distance Profile Example

Reuse Distance Measurement • Maintain stack of all previous data addresses • For each reference: • Search stack for referenced address • Depth in stack = reuse distance • If not found, distance = ∞ • Remove from stack, push on top

Example A B C C B A Address ∞ ∞ ∞ 0 1 2 Distance C A A B C C B B B B C A A A A

RDA Applications • VM page locality [Mattson 1970] • Cache performance prediction [Beyls01, Zhong03] • Cache hinting [Beyls05] • Code restructuring [Beyls06], data layout [Zhong04] • Application performance modeling [Marin04] • Phase prediction [Shen04] • Visualization, manual optimization [Beyls04,05,Marin08] • Modeling cache contention (multiprogramming) [Chandra05,Suh01,Fedorova05,Kim04]

Measurement Methods • List-based stack algorithm is O(NM) • Balanced binary trees or splay trees O(NlogM) • [Olken81, Sugumar93] • Approximate analysis (tree compression) O(NloglogM) time and O(logM) space [Ding03]

Contributions • Multicore-Aware Reuse Distance Analysis • First RDA to include sharing and invalidation • Study different invalidation timing strategies • Acceleration of Multicore RDA • Sampling, Parallelization • Demonstration of application: selection of low-locality code • Validation against full analysis, hardware • Prefetching model in RDA • Hybrid analysis

Outline • Intro to Reuse Distance Analysis • Contributions • Multicore-Aware Reuse Distance Analysis • Design • Results • Sampled Parallel Reuse Distance Analysis • Design: Sampling, Parallelisim • Results • Application: selection of low-locality code

Extending RDA to Multicore • RDA defined for single reference stream • No prior work accounts for multithreading • Multicore-aware RDA accounts for invalidations and data sharing • Models locality of multi-threaded programs • Targets multicore processors with private or shared caches

Multicore Reuse Distance • Invalidations cause additional misses in private caches • 2nd order effect: holes can be filled without eviction • Sharing affects locality in shared caches • Inter-thread data reuse (reduces distance to shared data) • Capacity contention (increases distance to unshared data)

Invalidations Remote write Address A B C C A B A Distance (unaware) ∞ ∞ ∞ 0 1 2 ∞ ∞ ∞ 0 1 ∞ C A A B C C C B A B B B B C (hole) A A (hole) A (hole)

Invalidation Timing • Multithreaded interleaving is nondeterministic • If no races, invalidations can be propagated between write and next synchronization • Eager invalidation – immediately at write • Lazy invalidation – at next synchronization • Could increase reuse distance • Oracular invalidation – at previous sync. • Data-race-free (DRF) → will not be referenced by invalidated thread • Could decrease reuse distance

Sharing Remote write Address A B C C A B A Distance (unaware) ∞ ∞ ∞ 0 1 2 ∞ ∞ ∞ 0 2 2 1 A B A C C B C A B B A B C A A A B C

MCRD Results

Impact of Inaccuracy

Summary So Far • Compared Unaware and Multicore-aware RDA to simulated caches • Private caches: Unaware 37% error, aware 2.5% • Invalidation timing had minor affect on accuracy • Shared caches: Unaware 76+%, aware 4.6% • Made RDA viable for multithreaded workloads

Problems with Multicore RDA • RDA is slow in general • Even efficient implementations require O(log M) time per reference • Multi-threading makes it worse • Serialization • Synchronization (expensive bus-locked operations on every program reference) • Goal: Fast enough to use by programmers in development cycle

Accelerating Multicore RDA • Sampling • Parallelization

Reuse Distance Sampling • Randomly select individual references • Select count before sampled reference • Geometric distribution, expect 1/n sampled references • n = 1,000,000 • Fast mode until target reference is reached References Fast mode

Reuse Distance Sampling • Monitor all references until sampled address is reused (Analysis mode) • Track unique addresses in distance set • RD of the reuse reference is size of distance set • Return to fast mode until next sample References Fast mode Analysis mode

Reuse Distance Sampling • Analysis mode is faster than full RDA • Full stack tracking not needed • Distance set implemented as hash table References Fast mode Analysis mode

RD Sampling of MT Programs • Data Sharing • Invalidation • Invalidation of tracked address • Invalidation of address in the distance set

RD Sampling of MT programs • Data Sharing • Analysis mode sees references from all threads • Reuse reference can be on any thread Remote thread Tracking thread Fast mode Analysis mode

RD Sampling of MT programs • Invalidation of tracked address • ∞ distance Fast mode Analysis mode

RD Sampling of MT programs • Invalidation of address in distance set • Remove from set, increment hole count • New addresses “fill” holes (decrement count) At reuse, RD = set size + hole count Fast mode Analysis mode

Parallel Measurement • Goals: Get parallelism in analysis, eliminate per-ref synchronization • 2 properties facilitate • Sampled analysis only tracks distance set, not whole stack • Allows separation of state • Exact timing of invalidations not significant • Allows delayed synchronization

Parallel Measurement • Data Sharing • Each thread has its own distance set • All sets merged on reuse Remote thread At reuse, RD = set size Tracking thread Fast mode Analysis mode

Parallel Measurement • Invalidations • Other threads record write sets • On synchronization, write set contents invalidated from distance set Remote thread Tracking thread Fast mode Analysis mode

Pruning • Analysis mode stays active until reuse • What if address is never reused? • Program locality determines time spent in analysis mode • Periodically prune (remove & record) the oldest sample • If its distance is large enough, e.g. top 1% of distances seen so far • Size-relative threshold allows different input sizes

Results • Comparison with full analysis • Histograms • Accuracy metric • Performance • Slowdown from native

Example RD Histograms Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

Example RD Histograms Slowdown of full analysis perturbs execution of spin-locks, inflates 0-distance bin in histogram Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

Example RD Histograms Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

Results: Private Stacks • Error metric used by previous work: • Normalize histogram bins • Error E = ∑i(|fi - si|) • Accuracy = 1 – E / 2 • 91%-99% accuracy (avg 95.6%) • 177x faster than full analysis • 7.1x-143x slowdown from native (avg 29.6x) • Fast mode: 5.3x • 80.4% of references in fast mode

Results: Shared Stacks • Shared reuse distances depend on all references by other threads • Not just to shared data • Relative execution rate matters • More variation in measurements and in real execution • Compare fully-parallel sample analysis mode to serialized sample analysis mode • Round-robin ensures threads progress at same rate as in non-sampled analysis

FT Histogram Reuse distance (64-byte blocks)

Performance Comparison • Single-thread sampling [Zhong08] • Instrumentation 2x-4x (compiler), 4x-10x (Valgrind) • Additional 10x-90x with analysis • Approximate non-random sampling [Beyls04] • 15x-25x (single-thread, compiler) • Valgrind, our benchmarks • Instrumentation 4x-75x, avg 23x • Memcheck avg 97x

Low-locality PC Selection • Application: Find code with poor locality to assist programmer optimization • e.g. n PCs account for y% of misses at cache size C • Select C such that miss ratio is 10%, find enough PCs to cover 75/80/90/95% of misses • Use weight-matching to compare selection against full analysis • Selection accuracy 91% - 92% for private and shared caches • In spite of reduced accuracy in parallel-shared

Smarter Multithreaded Replacement • Shared cache management is challenging • Benefits of demand multiplexing • Cost of performance interference • Most work addresses multi-programming • Destructive interference only • Per-benchmark performance targets • Multi-threading presents opportunities and challenges • Constructive interference, process performance target • Reuse distance profiles can help understand needs • Work in progress!

Conclusion • Two techniques to accelerate multicore-aware reuse distance analysis • Sampled analysis • Parallel analysis • Private caches: 96% accuracy, 30x native • Shared caches: 74/89% accuracy, 80/265x native • Demonstrate effectiveness for selection of code with low locality • 91% weight-matched coverage of PCs • Other applications in progress • Validated against hardware caches • 7-16% average error in miss prediction

Questions?

Improving the Speed and Quality of Architectural Performance Evaluation

Improving the Speed and Quality of Architectural Performance Evaluation

Presentation Transcript

The Evaluation of Pipe Performance and Durability

Improving Outcomes through Quality and Performance Improvement

Quality and Performance Evaluation of VoIP End-points

Improving the Quality of Life

Improving the Energy Performance of Homes and Households

Improving the Quality of Housing

Improving Speed and Agility

Improving Performance

The Impact of Ads on Performance and Improving Perceived Performance

“ Improving the performance of managers and organisations”

Improving Efficiency of the KURT-Linux Data Streams Performance Evaluation Framework

“Synergy Improving Quality in European Schools? Concepts of Evaluation

“Synergy Improving Quality in European Schools? Concepts of Evaluation

Speed/Performance

Performance and Performance Evaluation

Performance at the speed of change

Evaluation and Management of Quality

Improving the Quality of Housing