Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering University of Toronto

IBM Power 5 AMD Opteron Intel Yonah Chip Multiprocessors (CMPs) are Here! Use CMPs to improve sequential program performance?

Exploiting CMPS: The Intuition CMPs have lots of distributed resources • Caches, branch predictors, processors Somehow distribute sequential programs • Use distributed resources to improve performance Increasingly aggressive approaches: • Prefetching (eg., helper threads) • Transactions and transactional memory • Thread-Level Speculation (TLS) But distributing a sequential program is non-trivial…

Sequential Program Locality Distributed CMP Resources Parallelism P P P P L1 L1 L1 L1 L2 Our challenge: relaxing this tension Exploiting CMPs: The Tension

 TLS execution P P P P P P L1 L1 L1 L1 L1 L1 L2 L2 4X total cache capacity  4X cache performance? Example: TLS Execution on 4 Processors Sequential execution Execution Time P P active L1 L1 inactive

TLS on 4 CPU CMP: % Increase in Cache Misses 272.5% ~= 4X 4X total cache capacity  4X increase in cache misses 

Opportunities for Improvement • Prefetching Effects • TLS indirectly prefetches from off-chip into L2 • Orthogonal to the focus of this work • “Locality Misses” • An L1 miss where the line is resident in another L1 • An indicator of both: • Broken locality • Opportunity to repair locality What fraction of misses are locality misses?

TLS on 4 CPU CMP: % Locality Misses 61.1% significant locality misses: problem and opportunity

Outline • Experimental Framework • Classification of Misses • Techniques for Reducing Misses • Combining Techniques • Impact on Scalability • Conclusion

   Support for TLS Break programs into speculative threads • We use the compiler Track data dependences • We extend invalidation-based cache coherence Recover from failed speculation • We extend L1 data caches to buffer speculative state three key elements of every TLS system

 Compiler Support for TLS profile information inserts TLS instructions which loops? Transformation and Optimization Sequential SourceCode MIPS Executable Region Selection

P Tag SL SM P P P P L1 L1 L1 L1 - - - - - - - - - - - - - - - - - - - -  L2  Hardware Support for TLS Cache Data State extend generic CMP’s L1 caches and coherence

Experimental Framework • CMP with 4 CPUs (or more) • 4-way issue, out-of-order superscalar • Memory Hierarchy • Private L1 data caches: 32KB, 2-way • 2MB shared L2 cache • Bus interconnect • Not shown: results for crossbar interconnect • Benchmarks: SPEC INT 95 and 2000 • Speculatively parallelized

TLS Cache Locality Problem: Our Investigation Cache Locality Problem

Shared Cache Architecture Private Cache Architecture a shared cache solves locality problems (but slow) TLS Cache Locality Problem: Our Investigation Cache Locality Problem

Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache i-cache misses are insignificant; focus on d-cache TLS Cache Locality Problem: Our Investigation Cache Locality Problem

Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache Parallel Regions Sequential Regions miss patterns transitions TLS Cache Locality Problem: Our Investigation Cache Locality Problem

SequentialRegion P P P P Little impact Startup ParallelRegion SteadyState Our main focus Wind-down Has impact SequentialRegion TLS Execution Stages and Transitions time wind-down transitions: scheduling the seq. region

Floating Sequential Processor Fixed Sequential Processor P1 P2 P3 P0 Potential Cache Locality Scheduling the Sequential Region which is better?

Performance of Fixed Relative to Floating Overall Program: 3.4% speedup fixed sequential processor is superior, at no cost

Private Cache Architecture Shared Cache Architecture Data Cache Instruction Cache Parallel Regions Sequential Regions TLS Cache Locality Problem: Our Investigation Cache Locality Problem miss patterns transitions

Classifying Misses Within Parallel Regions • L2 Misses (ignore) • These cannot be locality misses (inclusion enforced) • Read-based sharing • Line is read by multiple processors • Write-based sharing • Line is written (and possibly read) by multiple processors • Strided • Addresses of missing lines progress by a cross-CPU stride • Other (ignore) • No observable patterns; likely conflict and capacity misses caveats: there is overlap; priority order; sliding window

71.3% Miss Patterns Observed investigate techniques targeting these three patterns

Exploiting Read-Only Sharing Patterns • Read-only sharing misses dominate (53.7%) • Hence a given read miss predicts future read misses • i.e., other CPUs will likely read-miss that same line • Broadcasting for all read misses • Any read miss results in that line being pushed to all caches • Provided lines in speculative state are not evicted • Trivial to implement in CMP with bus interconnect • No extra traffic will such broadcasting result in cache pollution?

Impact of Broadcasting All Read Misses (RB) Execution Time Data Cache Misses • Attempts to throttle broadcasting reduced benefits • Hence resulting cache pollution is limited 27.7% reduction 7.3% speedup simple broadcasting is effective

71.3% Miss Patterns Observed

Exploiting Write-Based Sharing Patterns • Note: caches extended for TLS are write-back • Modifications are not propagated before thread commits • Example: write-based sharing of a cache line • CPU0 writes then commits; then CPU1 reads • Read results in miss, read-request, write-back, then fill • Aggressive approach: • On commit, broadcast all modified lines • Too much traffic, too many superfluous copies • A more selective approach: • Predict lines involved in write-based sharing more general: predict stores involved in WB sharing

Recent Store Table (RST) Invalidation PC List (IPCL) Push Required Buffer (PRB) store PC etag store PC RST Index store PC store PC 8 Entries 8 Entries store PC etag 8 Entries store PC etag (Recent store PCs) store PC (lines to push on commit) (Store PCs for lines that are written back) Extended Tag (etag) Address: tag index offset RST Index 8-entries each is sufficient Predicting Stores & Lines Involved in WB Sharing

Operation of Write-Based Sharing Technique On a store: • Add store PC to Recent Store Table (RST) • If store PC is in Invalidation PC List (IPCL): • Add store PC to Push Required Buffer (PRB) On a coherence request requring writeback: • Use RST index to lookup PC in RST, add PC to IPCL On commit: • For each extended tag in PRB: • Writeback, self-invalidate, push line to next cache simple case: next cache is in round-robin order

Impact of Write-Based Technique (WB) Data Cache Misses Execution Time 19.6% reduction 7.8% speedup worth the cost of small additional hardware

71.3% Miss Patterns Observed

Exploiting Strided Miss Patterns • Hardware stride-prefetcher [Fu et al, Baer et al] • Each CPU has its own aggressive prefetcher • Fully associative, 512 entries: • PC, miss address, stride distance, state • Issue 16 prefetches when stride is recognized • Prefetches are throttled to avoid burst of traffic • Prefetch from L2 to private caches • To be fair, prefetches do not go beyond L2

Impact of Strided Prefetching (ST) Data Cache Misses Execution Time 10.3% reduction No significant impact no good alone---complementary with other techniques?

Combining Techniques: Parallel Region Perf. RB/WB/ST has fewest misses, but RB/WB performs best

Overall Program Speedup RB/WB further improves program performance by 5.5%

Impact of RB/WB on Scalability Bzip2_comp Vpr_place Average(all benchmarks) facilitates scaling

Summary • Have a fixed processor for sequential regions • Exploiting read-only sharing patterns (RB): • Simple broadcasting for all load misses is effective • No significant cache pollution • Exploiting write-based sharing patterns (WB): • Write-back/self-invalidate/push technique is effective • Exploiting strided miss patterns (ST): • Extra traffic overwhelms benefit of reduced misses • RB/WB are complementary and perform best • And dramatically improve the scalability of TLS Improving cache locality is key for effective TLS

Backups

Ideal Caches

L2 Misses Read-Based Sharing Write-Based Sharing Strided Other Parallel Region Cache Miss Breakdown

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Presentation Transcript

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

Enabling Thread Level Speculation via A Transactional Memory System

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM?

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

The potential for Software-only thread-level speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Cache Locality for Non-numerical Codes

The potential for Software-only thread-level speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Locality-Aware Data Replication in the Last-Level Cache

Chapter 5: Thread Level Parallelism and Cache Coherence

Generating Network Topologies That Obey Power Laws Christopher R. Palmer and J. Gregory Steffan

Improving Cache Performance

Applying Thread Level Speculation to Database Transactions

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Gregory J Luton

Gregory J Luton