1 / 19

CS 7810 Lecture 9

CS 7810 Lecture 9. Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995. Memory Hierarchy Bottlenecks. Caching Strategies  victim cache, replacement

Download Presentation

CS 7810 Lecture 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995

  2. Memory Hierarchy Bottlenecks • Caching Strategies  victim cache, replacement • policies, temporal/spatial caches • Prefetching  stream buffers, strided predictors, • pointer-chasing • Memory dependences  store barrier cache, • store sets • Latency tolerance  o-o-o execution, runahead

  3. Data Access Patterns • Scalar: simple variable references • Zero stride: constant array index throughout a loop • Constant stride: index is a linear function of loop • count • Irregular: None of the above (non-linear index, • pointer chasing, etc.)

  4. Prefetch Overheads • A regular access can be delayed • Increased contention on buses and L1/L2/memory ports • Before initiating the prefetch, the L1 tags have to be examined • Cache pollution and more misses • Software prefetching increases instruction count

  5. Software Prefetching • Pros: reduces hardware overhead, can avoid the • first miss (software pipelining), can handle complex • address equations, • Cons: Code bloat, can only handle addresses that • are easy to compute, control flow is a problem, • unpredictable latencies

  6. Basic Reference Prediction • For each PC, detect and store a stride and the • last fetched address • Every fetch initiates the next prefetch • If the stride changes, remain in transient states • until a regular stride is observed • Prefetches are not issued only in no-pred state

  7. Basic Reference Prediction Outstanding Request List PC tag prev_addr stride state prefetch address L1 Tags incorrect init steady correct correct incorrect (update stride) correct correct trans no-pred incorrect (update stride) incorrect (update stride)

  8. Shortcomings • Basic technique only prefetches one iteration • ahead – Lookahead predictor • Mispredictions at the end of every inner loop

  9. Lookahead Prediction • Note: these are in-order processors • Fetch stalls when instructions stall – but, continue • to increment PC • The Lookahead PC (LA-PC) accesses the branch • predictor and BTB to make forward progress • The LA-PC indexes into the RPT and can be up to • d instructions ahead • Note additional bpred and BTB ports

  10. Lookahead Prediction look-up RPT LA-PC BPred PC update ORL decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is

  11. Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) • Access pattern for W: • (1) • (2) (2) • (3) (3) (3) • (4) (4) (4) (4) (4)

  12. Correlated Reference Prediction • Inner-loop predictions work well, but the first • inner-loop prediction always fails • There is a correlation between the branch • outcomes and the reference patterns for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) 10111 (2,2) 1011101 (3,0) 10111011 (3,1) 101110111 (3,2) 1011101111 (3,3) 101110111101 (4,0) 1011101111011 (4,1) 10111011110111 (4,2) …

  13. Implementation • Each PC keeps track of multiple access patterns • (prev_addr and stride) • The branch history determines the patterns that • are relevant (01 refers to an outer loop access) • Other branches in the loop can mess up the • history – use the compiler to mark loop-termination • branches?

  14. Benchmark Characteristics

  15. Results

  16. Results Summary • Lookahead is the most cost-effective technique • RPT needs 512 entries (4KB capacity) • Lookahead should be a little more than memory • latency

  17. Modern Processors • Pentium 4 • I-cache: use the bpred and BTB to stay ahead of current execution • L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) • Alpha 21364 has 16-entry victim caches in L1D and • L2D and 16-entry L1I stream buffer • Ultra Sparc III has a 2KB prefetch cache

  18. Next Week’s Paper • “Runahead Execution: An Alternative to Very • Large Instruction Windows for Out-of-Order • Processors”, O. Mutlu, J. Stark, C. Wilkerson, • Y.N. Patt, Proceedings of HPCA-9, February 2003 • Useful execution while waiting for a cache miss – • perhaps, prefetch the next miss?

  19. Title • Bullet

More Related