1 / 40

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering University of Toronto. IBM Power 5. AMD Opteron. Intel Yonah. Chip Multiprocessors (CMPs) are Here!.  Use CMPs to improve sequential program performance?.

taline
Download Presentation

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering University of Toronto

  2. IBM Power 5 AMD Opteron Intel Yonah Chip Multiprocessors (CMPs) are Here! Use CMPs to improve sequential program performance?

  3. Exploiting CMPS: The Intuition CMPs have lots of distributed resources • Caches, branch predictors, processors Somehow distribute sequential programs • Use distributed resources to improve performance Increasingly aggressive approaches: • Prefetching (eg., helper threads) • Transactions and transactional memory • Thread-Level Speculation (TLS) But distributing a sequential program is non-trivial…

  4. Sequential Program Locality Distributed CMP Resources Parallelism P P P P L1 L1 L1 L1 L2 Our challenge: relaxing this tension Exploiting CMPs: The Tension

  5. TLS execution P P P P P P L1 L1 L1 L1 L1 L1 L2 L2 4X total cache capacity  4X cache performance? Example: TLS Execution on 4 Processors Sequential execution Execution Time P P active L1 L1 inactive

  6. TLS on 4 CPU CMP: % Increase in Cache Misses 272.5% ~= 4X 4X total cache capacity  4X increase in cache misses 

  7. Opportunities for Improvement • Prefetching Effects • TLS indirectly prefetches from off-chip into L2 • Orthogonal to the focus of this work • “Locality Misses” • An L1 miss where the line is resident in another L1 • An indicator of both: • Broken locality • Opportunity to repair locality What fraction of misses are locality misses?

  8. TLS on 4 CPU CMP: % Locality Misses 61.1% significant locality misses: problem and opportunity

  9. Outline • Experimental Framework • Classification of Misses • Techniques for Reducing Misses • Combining Techniques • Impact on Scalability • Conclusion

  10.   Support for TLS Break programs into speculative threads • We use the compiler Track data dependences • We extend invalidation-based cache coherence Recover from failed speculation • We extend L1 data caches to buffer speculative state three key elements of every TLS system

  11. Compiler Support for TLS profile information inserts TLS instructions which loops? Transformation and Optimization Sequential SourceCode MIPS Executable Region Selection

  12. P Tag SL SM P P P P L1 L1 L1 L1 - - - - - - - - - - - - - - - - - - - -  L2  Hardware Support for TLS Cache Data State extend generic CMP’s L1 caches and coherence

  13. Experimental Framework • CMP with 4 CPUs (or more) • 4-way issue, out-of-order superscalar • Memory Hierarchy • Private L1 data caches: 32KB, 2-way • 2MB shared L2 cache • Bus interconnect • Not shown: results for crossbar interconnect • Benchmarks: SPEC INT 95 and 2000 • Speculatively parallelized

  14. TLS Cache Locality Problem: Our Investigation Cache Locality Problem

  15. Shared Cache Architecture Private Cache Architecture a shared cache solves locality problems (but slow) TLS Cache Locality Problem: Our Investigation Cache Locality Problem

  16. Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache i-cache misses are insignificant; focus on d-cache TLS Cache Locality Problem: Our Investigation Cache Locality Problem

  17. Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache Parallel Regions Sequential Regions miss patterns transitions TLS Cache Locality Problem: Our Investigation Cache Locality Problem

  18. SequentialRegion P P P P Little impact Startup ParallelRegion SteadyState Our main focus Wind-down Has impact SequentialRegion TLS Execution Stages and Transitions time wind-down transitions: scheduling the seq. region

  19. Floating Sequential Processor Fixed Sequential Processor P1 P2 P3 P0 Potential Cache Locality Scheduling the Sequential Region which is better?

  20. Performance of Fixed Relative to Floating Overall Program: 3.4% speedup fixed sequential processor is superior, at no cost

  21. Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache Parallel Regions Sequential Regions TLS Cache Locality Problem: Our Investigation Cache Locality Problem miss patterns transitions

  22. Classifying Misses Within Parallel Regions • L2 Misses (ignore) • These cannot be locality misses (inclusion enforced) • Read-based sharing • Line is read by multiple processors • Write-based sharing • Line is written (and possibly read) by multiple processors • Strided • Addresses of missing lines progress by a cross-CPU stride • Other (ignore) • No observable patterns; likely conflict and capacity misses caveats: there is overlap; priority order; sliding window

  23. 71.3% Miss Patterns Observed investigate techniques targeting these three patterns

  24. Exploiting Read-Only Sharing Patterns • Read-only sharing misses dominate (53.7%) • Hence a given read miss predicts future read misses • i.e., other CPUs will likely read-miss that same line • Broadcasting for all read misses • Any read miss results in that line being pushed to all caches • Provided lines in speculative state are not evicted • Trivial to implement in CMP with bus interconnect • No extra traffic will such broadcasting result in cache pollution?

  25. Impact of Broadcasting All Read Misses (RB) Execution Time Data Cache Misses • Attempts to throttle broadcasting reduced benefits • Hence resulting cache pollution is limited 27.7% reduction 7.3% speedup simple broadcasting is effective

  26. 71.3% Miss Patterns Observed

  27. Exploiting Write-Based Sharing Patterns • Note: caches extended for TLS are write-back • Modifications are not propagated before thread commits • Example: write-based sharing of a cache line • CPU0 writes then commits; then CPU1 reads • Read results in miss, read-request, write-back, then fill • Aggressive approach: • On commit, broadcast all modified lines • Too much traffic, too many superfluous copies • A more selective approach: • Predict lines involved in write-based sharing more general: predict stores involved in WB sharing

  28. Recent Store Table (RST) Invalidation PC List (IPCL) Push Required Buffer (PRB) store PC etag store PC RST Index store PC store PC 8 Entries 8 Entries store PC etag 8 Entries store PC etag (Recent store PCs) store PC (lines to push on commit) (Store PCs for lines that are written back) Extended Tag (etag) Address: tag index offset RST Index 8-entries each is sufficient Predicting Stores & Lines Involved in WB Sharing

  29. Operation of Write-Based Sharing Technique On a store: • Add store PC to Recent Store Table (RST) • If store PC is in Invalidation PC List (IPCL): • Add store PC to Push Required Buffer (PRB) On a coherence request requring writeback: • Use RST index to lookup PC in RST, add PC to IPCL On commit: • For each extended tag in PRB: • Writeback, self-invalidate, push line to next cache simple case: next cache is in round-robin order

  30. Impact of Write-Based Technique (WB) Data Cache Misses Execution Time 19.6% reduction 7.8% speedup worth the cost of small additional hardware

  31. 71.3% Miss Patterns Observed

  32. Exploiting Strided Miss Patterns • Hardware stride-prefetcher [Fu et al, Baer et al] • Each CPU has its own aggressive prefetcher • Fully associative, 512 entries: • PC, miss address, stride distance, state • Issue 16 prefetches when stride is recognized • Prefetches are throttled to avoid burst of traffic • Prefetch from L2 to private caches • To be fair, prefetches do not go beyond L2

  33. Impact of Strided Prefetching (ST) Data Cache Misses Execution Time 10.3% reduction No significant impact no good alone---complementary with other techniques?

  34. Combining Techniques: Parallel Region Perf. RB/WB/ST has fewest misses, but RB/WB performs best

  35. Overall Program Speedup RB/WB further improves program performance by 5.5%

  36. Impact of RB/WB on Scalability Bzip2_comp Vpr_place Average(all benchmarks) facilitates scaling

  37. Summary • Have a fixed processor for sequential regions • Exploiting read-only sharing patterns (RB): • Simple broadcasting for all load misses is effective • No significant cache pollution • Exploiting write-based sharing patterns (WB): • Write-back/self-invalidate/push technique is effective • Exploiting strided miss patterns (ST): • Extra traffic overwhelms benefit of reduced misses • RB/WB are complementary and perform best • And dramatically improve the scalability of TLS Improving cache locality is key for effective TLS

  38. Backups

  39. Ideal Caches

  40. L2 Misses Read-Based Sharing Write-Based Sharing Strided Other Parallel Region Cache Miss Breakdown

More Related