1 / 46

Improving the Speed and Quality of Architectural Performance Evaluation

Vijay S. Pai with contributions from: Derek Schuff, Milind Kulkarni. Improving the Speed and Quality of Architectural Performance Evaluation. Electrical and Computer Engineering Purdue University. Outline. Intro to Reuse Distance Analysis Contributions

robyn
Download Presentation

Improving the Speed and Quality of Architectural Performance Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vijay S. Pai with contributions from: Derek Schuff, Milind Kulkarni Improving the Speed and Quality of Architectural Performance Evaluation Electrical and Computer Engineering Purdue University

  2. Outline • Intro to Reuse Distance Analysis • Contributions • Multicore-Aware Reuse Distance Analysis • Design • Results • Sampled Parallel Reuse Distance Analysis • Design: Sampling, Parallelisim • Results • Application: selection of low-locality code

  3. Reuse Distance Analysis • Reuse Distance Analysis (RDA): architecture-neutral locality profile • Number of distinct data referenced between use and reuse of data element • Elements can be memory pages, disk blocks, cache blocks, etc • Machine-independent model of locality • Predicts hit ratio in any size fully-associative LRU cache • Hit ratio in cache with X blocks = % of references with RD < X

  4. Reuse Distance Analysis • Applications in performance modeling, optimization • Multiprogramming/scheduling interaction, phase prediction • Cache hint generation, restructuring code, data layout

  5. Reuse Distance Profile Example

  6. Reuse Distance Measurement • Maintain stack of all previous data addresses • For each reference: • Search stack for referenced address • Depth in stack = reuse distance • If not found, distance = ∞ • Remove from stack, push on top

  7. Example A B C C B A Address ∞ ∞ ∞ 0 1 2 Distance C A A B C C B B B B C A A A A

  8. RDA Applications • VM page locality [Mattson 1970] • Cache performance prediction [Beyls01, Zhong03] • Cache hinting [Beyls05] • Code restructuring [Beyls06], data layout [Zhong04] • Application performance modeling [Marin04] • Phase prediction [Shen04] • Visualization, manual optimization [Beyls04,05,Marin08] • Modeling cache contention (multiprogramming) [Chandra05,Suh01,Fedorova05,Kim04]

  9. Measurement Methods • List-based stack algorithm is O(NM) • Balanced binary trees or splay trees O(NlogM) • [Olken81, Sugumar93] • Approximate analysis (tree compression) O(NloglogM) time and O(logM) space [Ding03]

  10. Contributions • Multicore-Aware Reuse Distance Analysis • First RDA to include sharing and invalidation • Study different invalidation timing strategies • Acceleration of Multicore RDA • Sampling, Parallelization • Demonstration of application: selection of low-locality code • Validation against full analysis, hardware • Prefetching model in RDA • Hybrid analysis

  11. Outline • Intro to Reuse Distance Analysis • Contributions • Multicore-Aware Reuse Distance Analysis • Design • Results • Sampled Parallel Reuse Distance Analysis • Design: Sampling, Parallelisim • Results • Application: selection of low-locality code

  12. Extending RDA to Multicore • RDA defined for single reference stream • No prior work accounts for multithreading • Multicore-aware RDA accounts for invalidations and data sharing • Models locality of multi-threaded programs • Targets multicore processors with private or shared caches

  13. Multicore Reuse Distance • Invalidations cause additional misses in private caches • 2nd order effect: holes can be filled without eviction • Sharing affects locality in shared caches • Inter-thread data reuse (reduces distance to shared data) • Capacity contention (increases distance to unshared data)

  14. Invalidations Remote write Address A B C C A B A Distance (unaware) ∞ ∞ ∞ 0 1 2 ∞ ∞ ∞ 0 1 ∞ C A A B C C C B A B B B B C (hole) A A (hole) A (hole)

  15. Invalidation Timing • Multithreaded interleaving is nondeterministic • If no races, invalidations can be propagated between write and next synchronization • Eager invalidation – immediately at write • Lazy invalidation – at next synchronization • Could increase reuse distance • Oracular invalidation – at previous sync. • Data-race-free (DRF) → will not be referenced by invalidated thread • Could decrease reuse distance

  16. Sharing Remote write Address A B C C A B A Distance (unaware) ∞ ∞ ∞ 0 1 2 ∞ ∞ ∞ 0 2 2 1 A B A C C B C A B B A B C A A A B C

  17. MCRD Results

  18. Impact of Inaccuracy

  19. Summary So Far • Compared Unaware and Multicore-aware RDA to simulated caches • Private caches: Unaware 37% error, aware 2.5% • Invalidation timing had minor affect on accuracy • Shared caches: Unaware 76+%, aware 4.6% • Made RDA viable for multithreaded workloads

  20. Problems with Multicore RDA • RDA is slow in general • Even efficient implementations require O(log M) time per reference • Multi-threading makes it worse • Serialization • Synchronization (expensive bus-locked operations on every program reference) • Goal: Fast enough to use by programmers in development cycle

  21. Accelerating Multicore RDA • Sampling • Parallelization

  22. Reuse Distance Sampling • Randomly select individual references • Select count before sampled reference • Geometric distribution, expect 1/n sampled references • n = 1,000,000 • Fast mode until target reference is reached References Fast mode

  23. Reuse Distance Sampling • Monitor all references until sampled address is reused (Analysis mode) • Track unique addresses in distance set • RD of the reuse reference is size of distance set • Return to fast mode until next sample References Fast mode Analysis mode

  24. Reuse Distance Sampling • Analysis mode is faster than full RDA • Full stack tracking not needed • Distance set implemented as hash table References Fast mode Analysis mode

  25. RD Sampling of MT Programs • Data Sharing • Invalidation • Invalidation of tracked address • Invalidation of address in the distance set

  26. RD Sampling of MT programs • Data Sharing • Analysis mode sees references from all threads • Reuse reference can be on any thread Remote thread Tracking thread Fast mode Analysis mode

  27. RD Sampling of MT programs • Invalidation of tracked address • ∞ distance Fast mode Analysis mode

  28. RD Sampling of MT programs • Invalidation of address in distance set • Remove from set, increment hole count • New addresses “fill” holes (decrement count) At reuse, RD = set size + hole count Fast mode Analysis mode

  29. Parallel Measurement • Goals: Get parallelism in analysis, eliminate per-ref synchronization • 2 properties facilitate • Sampled analysis only tracks distance set, not whole stack • Allows separation of state • Exact timing of invalidations not significant • Allows delayed synchronization

  30. Parallel Measurement • Data Sharing • Each thread has its own distance set • All sets merged on reuse Remote thread At reuse, RD = set size Tracking thread Fast mode Analysis mode

  31. Parallel Measurement • Invalidations • Other threads record write sets • On synchronization, write set contents invalidated from distance set Remote thread Tracking thread Fast mode Analysis mode

  32. Pruning • Analysis mode stays active until reuse • What if address is never reused? • Program locality determines time spent in analysis mode • Periodically prune (remove & record) the oldest sample • If its distance is large enough, e.g. top 1% of distances seen so far • Size-relative threshold allows different input sizes

  33. Results • Comparison with full analysis • Histograms • Accuracy metric • Performance • Slowdown from native

  34. Example RD Histograms Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

  35. Example RD Histograms Slowdown of full analysis perturbs execution of spin-locks, inflates 0-distance bin in histogram Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

  36. Example RD Histograms Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)

  37. Results: Private Stacks • Error metric used by previous work: • Normalize histogram bins • Error E = ∑i(|fi - si|) • Accuracy = 1 – E / 2 • 91%-99% accuracy (avg 95.6%) • 177x faster than full analysis • 7.1x-143x slowdown from native (avg 29.6x) • Fast mode: 5.3x • 80.4% of references in fast mode

  38. Results: Shared Stacks • Shared reuse distances depend on all references by other threads • Not just to shared data • Relative execution rate matters • More variation in measurements and in real execution • Compare fully-parallel sample analysis mode to serialized sample analysis mode • Round-robin ensures threads progress at same rate as in non-sampled analysis

  39. FT Histogram Reuse distance (64-byte blocks)

  40. Performance Comparison • Single-thread sampling [Zhong08] • Instrumentation 2x-4x (compiler), 4x-10x (Valgrind) • Additional 10x-90x with analysis • Approximate non-random sampling [Beyls04] • 15x-25x (single-thread, compiler) • Valgrind, our benchmarks • Instrumentation 4x-75x, avg 23x • Memcheck avg 97x

  41. Low-locality PC Selection • Application: Find code with poor locality to assist programmer optimization • e.g. n PCs account for y% of misses at cache size C • Select C such that miss ratio is 10%, find enough PCs to cover 75/80/90/95% of misses • Use weight-matching to compare selection against full analysis • Selection accuracy 91% - 92% for private and shared caches • In spite of reduced accuracy in parallel-shared

  42. Smarter Multithreaded Replacement • Shared cache management is challenging • Benefits of demand multiplexing • Cost of performance interference • Most work addresses multi-programming • Destructive interference only • Per-benchmark performance targets • Multi-threading presents opportunities and challenges • Constructive interference, process performance target • Reuse distance profiles can help understand needs • Work in progress!

  43. Conclusion • Two techniques to accelerate multicore-aware reuse distance analysis • Sampled analysis • Parallel analysis • Private caches: 96% accuracy, 30x native • Shared caches: 74/89% accuracy, 80/265x native • Demonstrate effectiveness for selection of code with low locality • 91% weight-matched coverage of PCs • Other applications in progress • Validated against hardware caches • 7-16% average error in miss prediction

  44. Questions?

More Related