180 likes | 190 Views
This paper explores the concept of runahead execution as an alternative to very large instruction windows for out-of-order processors. It discusses the benefits and challenges of implementing runahead execution and presents experimental results showing its performance improvement compared to traditional execution methods. The paper also raises unanswered questions and suggests alternative approaches for further research.
E N D
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt Proceedings of HPCA-9 February 2003
In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c
In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c #97 Load instruction – cache miss 300 cycles
Memory Bottlenecks • 128-entry window, real L2 0.77 IPC • 128-entry window, perfect L2 1.69 • 2048-entry window, real L2 1.15 • 2048-entry window, perfect L2 2.02 • 128-entry window, real L2, runahead 0.94
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the oldest instruction is a cache miss, behave like it • causes a context-switch: • checkpoint the committed registers, rename table, return • address stack, and branch history register • assume a bogus value and start a new thread • this thread cannot modify program state, but can prefetch
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the cache miss returns, copy the registers and the • mapping and start executing from that ld/st instruction • cost of copying back and forth is not trivial • many instructions get executed twice
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Note that some values are missing: • Do not bother to execute instrs that have invalid inputs • Accelerates the thread and generates accurate prefetches • Unknown store addresses are ignored
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Runahead instrs write to registers (as before), but runahead • stores write to the runahead cache: • Runahead cache and L1D are accessed in parallel • If a block gets evicted out of runahead cache, data is lost
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • The branch predictor gets accessed/updated twice • Cannot resolve branch mispredicts if the branch has an • invalid input
Another Form of Runahead Occasional State Copy and Re-start Primary Thread Runahead Thread
Methodology • 80 benchmarks – 147 code sequences (that are • memory-bound) – each 30M instructions – SPEC, • Web, Media, Server, workstation, productivity • Pentium 4 hardware prefetcher – eight stream • buffers that stay 256 bytes ahead • Also evaluate a “future baseline” with twice as • many resources • Perfect memory disam, 500-cycle memory access
Results • Runahead improves performance by 22% • Synergistic interaction between prefetch & runahead – is the stream • buffer not keeping up?
Other Results • Runahead with a 128-entry window does as well • as a 384-entry window • A better front-end improves benefits from runahead • On average, 431 useful instructions per runahead • and 280 after a mispredict • Without the runahead cache, only half the • improvement is observed
Unanswered Questions • How many re-execs? How many invalid instrs? • How much wasted power? – re-execs, double • writes to checkpoints • How many accesses to hash tables, pointers, and • branch-dependent data?
Alternative Approaches • Does runahead lead to excessive power and • verification complexity? • Better stride prefetchers or stream buffers? • Is this the best way to support a large in-flight • window (register file, issueq, ROB)?
Next Week’s Paper • “Delaying Physical Register Allocation Through • Virtual-Physical Registers”, T. Monreal, • A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, • Proceedings of MICRO-32, November 1999
Title • Bullet