200 likes | 231 Views
CS 7810 Lecture 9. Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995. Memory Hierarchy Bottlenecks. Caching Strategies victim cache, replacement
E N D
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995
Memory Hierarchy Bottlenecks • Caching Strategies victim cache, replacement • policies, temporal/spatial caches • Prefetching stream buffers, strided predictors, • pointer-chasing • Memory dependences store barrier cache, • store sets • Latency tolerance o-o-o execution, runahead
Data Access Patterns • Scalar: simple variable references • Zero stride: constant array index throughout a loop • Constant stride: index is a linear function of loop • count • Irregular: None of the above (non-linear index, • pointer chasing, etc.)
Prefetch Overheads • A regular access can be delayed • Increased contention on buses and L1/L2/memory ports • Before initiating the prefetch, the L1 tags have to be examined • Cache pollution and more misses • Software prefetching increases instruction count
Software Prefetching • Pros: reduces hardware overhead, can avoid the • first miss (software pipelining), can handle complex • address equations, • Cons: Code bloat, can only handle addresses that • are easy to compute, control flow is a problem, • unpredictable latencies
Basic Reference Prediction • For each PC, detect and store a stride and the • last fetched address • Every fetch initiates the next prefetch • If the stride changes, remain in transient states • until a regular stride is observed • Prefetches are not issued only in no-pred state
Basic Reference Prediction Outstanding Request List PC tag prev_addr stride state prefetch address L1 Tags incorrect init steady correct correct incorrect (update stride) correct correct trans no-pred incorrect (update stride) incorrect (update stride)
Shortcomings • Basic technique only prefetches one iteration • ahead – Lookahead predictor • Mispredictions at the end of every inner loop
Lookahead Prediction • Note: these are in-order processors • Fetch stalls when instructions stall – but, continue • to increment PC • The Lookahead PC (LA-PC) accesses the branch • predictor and BTB to make forward progress • The LA-PC indexes into the RPT and can be up to • d instructions ahead • Note additional bpred and BTB ports
Lookahead Prediction look-up RPT LA-PC BPred PC update ORL decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is
Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) • Access pattern for W: • (1) • (2) (2) • (3) (3) (3) • (4) (4) (4) (4) (4)
Correlated Reference Prediction • Inner-loop predictions work well, but the first • inner-loop prediction always fails • There is a correlation between the branch • outcomes and the reference patterns for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) 10111 (2,2) 1011101 (3,0) 10111011 (3,1) 101110111 (3,2) 1011101111 (3,3) 101110111101 (4,0) 1011101111011 (4,1) 10111011110111 (4,2) …
Implementation • Each PC keeps track of multiple access patterns • (prev_addr and stride) • The branch history determines the patterns that • are relevant (01 refers to an outer loop access) • Other branches in the loop can mess up the • history – use the compiler to mark loop-termination • branches?
Results Summary • Lookahead is the most cost-effective technique • RPT needs 512 entries (4KB capacity) • Lookahead should be a little more than memory • latency
Modern Processors • Pentium 4 • I-cache: use the bpred and BTB to stay ahead of current execution • L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) • Alpha 21364 has 16-entry victim caches in L1D and • L2D and 16-entry L1I stream buffer • Ultra Sparc III has a 2KB prefetch cache
Next Week’s Paper • “Runahead Execution: An Alternative to Very • Large Instruction Windows for Out-of-Order • Processors”, O. Mutlu, J. Stark, C. Wilkerson, • Y.N. Patt, Proceedings of HPCA-9, February 2003 • Useful execution while waiting for a cache miss – • perhaps, prefetch the next miss?
Title • Bullet