200 likes | 235 Views
This paper explores various hardware-based data prefetching techniques to enhance processor performance by reducing memory hierarchy bottlenecks and improving caching strategies. It delves into prefetching mechanisms, memory dependences, latency tolerance, and data access patterns, discussing the advantages and drawbacks of software prefetching strategies. The text covers basic reference prediction, lookahead prediction, and correlated reference prediction techniques, along with their implementation details and benchmark characteristics. Results show that lookahead prediction is a cost-effective method, with lookahead slightly exceeding memory latency in efficiency. The paper also mentions modern processor implementations like Pentium 4, Alpha 21364, and Ultra Sparc III, highlighting prefetching features integrated into their cache designs. The upcoming paper on Runahead Execution is previewed at the end, offering an alternative strategy for out-of-order processors.
E N D
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995
Memory Hierarchy Bottlenecks • Caching Strategies victim cache, replacement • policies, temporal/spatial caches • Prefetching stream buffers, strided predictors, • pointer-chasing • Memory dependences store barrier cache, • store sets • Latency tolerance o-o-o execution, runahead
Data Access Patterns • Scalar: simple variable references • Zero stride: constant array index throughout a loop • Constant stride: index is a linear function of loop • count • Irregular: None of the above (non-linear index, • pointer chasing, etc.)
Prefetch Overheads • A regular access can be delayed • Increased contention on buses and L1/L2/memory ports • Before initiating the prefetch, the L1 tags have to be examined • Cache pollution and more misses • Software prefetching increases instruction count
Software Prefetching • Pros: reduces hardware overhead, can avoid the • first miss (software pipelining), can handle complex • address equations, • Cons: Code bloat, can only handle addresses that • are easy to compute, control flow is a problem, • unpredictable latencies
Basic Reference Prediction • For each PC, detect and store a stride and the • last fetched address • Every fetch initiates the next prefetch • If the stride changes, remain in transient states • until a regular stride is observed • Prefetches are not issued only in no-pred state
Basic Reference Prediction Outstanding Request List PC tag prev_addr stride state prefetch address L1 Tags incorrect init steady correct correct incorrect (update stride) correct correct trans no-pred incorrect (update stride) incorrect (update stride)
Shortcomings • Basic technique only prefetches one iteration • ahead – Lookahead predictor • Mispredictions at the end of every inner loop
Lookahead Prediction • Note: these are in-order processors • Fetch stalls when instructions stall – but, continue • to increment PC • The Lookahead PC (LA-PC) accesses the branch • predictor and BTB to make forward progress • The LA-PC indexes into the RPT and can be up to • d instructions ahead • Note additional bpred and BTB ports
Lookahead Prediction look-up RPT LA-PC BPred PC update ORL decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is
Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) • Access pattern for W: • (1) • (2) (2) • (3) (3) (3) • (4) (4) (4) (4) (4)
Correlated Reference Prediction • Inner-loop predictions work well, but the first • inner-loop prediction always fails • There is a correlation between the branch • outcomes and the reference patterns for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) 10111 (2,2) 1011101 (3,0) 10111011 (3,1) 101110111 (3,2) 1011101111 (3,3) 101110111101 (4,0) 1011101111011 (4,1) 10111011110111 (4,2) …
Implementation • Each PC keeps track of multiple access patterns • (prev_addr and stride) • The branch history determines the patterns that • are relevant (01 refers to an outer loop access) • Other branches in the loop can mess up the • history – use the compiler to mark loop-termination • branches?
Results Summary • Lookahead is the most cost-effective technique • RPT needs 512 entries (4KB capacity) • Lookahead should be a little more than memory • latency
Modern Processors • Pentium 4 • I-cache: use the bpred and BTB to stay ahead of current execution • L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) • Alpha 21364 has 16-entry victim caches in L1D and • L2D and 16-entry L1I stream buffer • Ultra Sparc III has a 2KB prefetch cache
Next Week’s Paper • “Runahead Execution: An Alternative to Very • Large Instruction Windows for Out-of-Order • Processors”, O. Mutlu, J. Stark, C. Wilkerson, • Y.N. Patt, Proceedings of HPCA-9, February 2003 • Useful execution while waiting for a cache miss – • perhaps, prefetch the next miss?
Title • Bullet