1 / 18

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

This paper explores the concept of runahead execution as an alternative to very large instruction windows for out-of-order processors. It discusses the benefits and challenges of implementing runahead execution and presents experimental results showing its performance improvement compared to traditional execution methods. The paper also raises unanswered questions and suggests alternative approaches for further research.

miriamm
Download Presentation

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt Proceedings of HPCA-9 February 2003

  2. In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c

  3. In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c #97 Load instruction – cache miss 300 cycles

  4. Memory Bottlenecks • 128-entry window, real L2  0.77 IPC • 128-entry window, perfect L2  1.69 • 2048-entry window, real L2  1.15 • 2048-entry window, perfect L2  2.02 • 128-entry window, real L2, runahead  0.94

  5. Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the oldest instruction is a cache miss, behave like it • causes a context-switch: • checkpoint the committed registers, rename table, return • address stack, and branch history register • assume a bogus value and start a new thread • this thread cannot modify program state, but can prefetch

  6. Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the cache miss returns, copy the registers and the • mapping and start executing from that ld/st instruction • cost of copying back and forth is not trivial • many instructions get executed twice

  7. Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Note that some values are missing: • Do not bother to execute instrs that have invalid inputs • Accelerates the thread and generates accurate prefetches • Unknown store addresses are ignored

  8. Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Runahead instrs write to registers (as before), but runahead • stores write to the runahead cache: • Runahead cache and L1D are accessed in parallel • If a block gets evicted out of runahead cache, data is lost

  9. Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • The branch predictor gets accessed/updated twice • Cannot resolve branch mispredicts if the branch has an • invalid input

  10. Another Form of Runahead Occasional State Copy and Re-start Primary Thread Runahead Thread

  11. Methodology • 80 benchmarks – 147 code sequences (that are • memory-bound) – each 30M instructions – SPEC, • Web, Media, Server, workstation, productivity • Pentium 4 hardware prefetcher – eight stream • buffers that stay 256 bytes ahead • Also evaluate a “future baseline” with twice as • many resources • Perfect memory disam, 500-cycle memory access

  12. Methodology

  13. Results • Runahead improves performance by 22% • Synergistic interaction between prefetch & runahead – is the stream • buffer not keeping up?

  14. Other Results • Runahead with a 128-entry window does as well • as a 384-entry window • A better front-end improves benefits from runahead • On average, 431 useful instructions per runahead • and 280 after a mispredict • Without the runahead cache, only half the • improvement is observed

  15. Unanswered Questions • How many re-execs? How many invalid instrs? • How much wasted power? – re-execs, double • writes to checkpoints • How many accesses to hash tables, pointers, and • branch-dependent data?

  16. Alternative Approaches • Does runahead lead to excessive power and • verification complexity? • Better stride prefetchers or stream buffers? • Is this the best way to support a large in-flight • window (register file, issueq, ROB)?

  17. Next Week’s Paper • “Delaying Physical Register Allocation Through • Virtual-Physical Registers”, T. Monreal, • A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, • Proceedings of MICRO-32, November 1999

  18. Title • Bullet

More Related