Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt Proceedings of HPCA-9 February 2003

In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c

In-Flight Windows p 1-32 Reorder Buffer p 33-128 #1 p33 Load instruction – cache miss 300 cycles #2 p34 c #3 p35 c #4 p36 Physical Register File . . #95 p127 c #96 p128 c #97 Load instruction – cache miss 300 cycles

Memory Bottlenecks • 128-entry window, real L2  0.77 IPC • 128-entry window, perfect L2  1.69 • 2048-entry window, real L2  1.15 • 2048-entry window, perfect L2  2.02 • 128-entry window, real L2, runahead  0.94

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the oldest instruction is a cache miss, behave like it • causes a context-switch: • checkpoint the committed registers, rename table, return • address stack, and branch history register • assume a bogus value and start a new thread • this thread cannot modify program state, but can prefetch

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • When the cache miss returns, copy the registers and the • mapping and start executing from that ld/st instruction • cost of copying back and forth is not trivial • many instructions get executed twice

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Note that some values are missing: • Do not bother to execute instrs that have invalid inputs • Accelerates the thread and generates accurate prefetches • Unknown store addresses are ignored

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • Runahead instrs write to registers (as before), but runahead • stores write to the runahead cache: • Runahead cache and L1D are accessed in parallel • If a block gets evicted out of runahead cache, data is lost

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs • The branch predictor gets accessed/updated twice • Cannot resolve branch mispredicts if the branch has an • invalid input

Another Form of Runahead Occasional State Copy and Re-start Primary Thread Runahead Thread

Methodology • 80 benchmarks – 147 code sequences (that are • memory-bound) – each 30M instructions – SPEC, • Web, Media, Server, workstation, productivity • Pentium 4 hardware prefetcher – eight stream • buffers that stay 256 bytes ahead • Also evaluate a “future baseline” with twice as • many resources • Perfect memory disam, 500-cycle memory access

Methodology

Results • Runahead improves performance by 22% • Synergistic interaction between prefetch & runahead – is the stream • buffer not keeping up?

Other Results • Runahead with a 128-entry window does as well • as a 384-entry window • A better front-end improves benefits from runahead • On average, 431 useful instructions per runahead • and 280 after a mispredict • Without the runahead cache, only half the • improvement is observed

Unanswered Questions • How many re-execs? How many invalid instrs? • How much wasted power? – re-execs, double • writes to checkpoints • How many accesses to hash tables, pointers, and • branch-dependent data?

Alternative Approaches • Does runahead lead to excessive power and • verification complexity? • Better stride prefetchers or stream buffers? • Is this the best way to support a large in-flight • window (register file, issueq, ROB)?

Next Week’s Paper • “Delaying Physical Register Allocation Through • Virtual-Physical Registers”, T. Monreal, • A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, • Proceedings of MICRO-32, November 1999

Title • Bullet

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

Presentation Transcript

CS 7810 Lecture 19

CS 7810 Lecture 17

CS 7810 Lecture 22

CS 7810 Lecture 25

CS 7810 Lecture 9

CS 7810 Lecture 2

CS 7810 Lecture 14

CS 7810 Lecture 8

CS 7810 Lecture 13

CS 7810 Lecture 21

CS 7810 Lecture 23

CS 7810 Lecture 9

CS 7810 Lecture 21

CS 7810 Lecture 3

CS 7810 Lecture 25

CS 7810 Lecture 8

CS 7810 Lecture 5

CS 7810 Lecture 12

CS 7810 Lecture 19

CS 7810 Lecture 22

CS 7810 Lecture 2