350 likes | 493 Views
Ginger: Control Independence Using Tag Rewriting. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu. ISCA-34 :: June, 2007. A: bez r1, D. D: r2=2. D: r2=2. B: r2=1. B: r2=1. Control dependent (CD) insns. C: jmp E. C: jmp E. }. E: r3=r1+1.
E N D
Ginger:Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-34 :: June, 2007
A: bez r1, D D: r2=2 D: r2=2 B: r2=1 B: r2=1 Control dependent (CD) insns C: jmp E C: jmp E } E: r3=r1+1 E: r3=r1+1 F: r4=r2+1 F: r4=r2+1 Control independent (CI) insns G: r5=ld(r4) G: r5=ld(r4) Control Independence (CI) Branch mispredictions limit single-thread performance • Improve prediction accuracy? Hard • Predicate? Cost on correct predictions • Exploit control independence (CI) to reduce squash penalty This paper: Ginger, a new (better) CI microarchitecture remember acronyms CI, CD
D: r2=2 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 E: r3=r1+1 G: r5=ld(r4) F: r4=r2+1 G: r5=ld(r4) D: r2=2 D: r2=2 B: r2=1 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 F: r4=r2+1 G: r5=ld(r4) Exploiting Control Independence A: bez r1, D Conventional recovery • Squash all post mis-prediction insns • Fetch/execute all correct-path insns • Re-fetch/re-execute CI insns (waste) A: bez r1, D CI recovery • Squash only wrong-path CD insns • Fetch/execute only correct-path CD insns • Preserve CI insns: E, F,G • Preserve un-dispatched CI insns: H, I… How to “Insert” CD insns? What to do about CI insns that depend on CD insns?
Start: wrong path Goal: correct path CI halfway A: bez p1, D A: bez p1, D A: bez p1, D 1 D: p2=2 D: p2=2 B: p6=1 B: p6=1 2 B: p6=1 C: jmp E C: jmp E E: p3=p1+1 E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 F: p4=p2+1 F: p4=p6+1 F: p4=p6+1 G: p5=ld(p4) G: p5=ld(p4) G: p5=ld(p4) Out-of-Order Renaming CI step 1: replace CD insns CI Step 2: out-of-order renaming • Step 1 changes inputs for some CI insns • CI data dependent (CIDD) insns: F and G (transitively, via F) • Must identify CIDD insns and repair their inputs • Must re-issue CIDD insns that have already issued • Key feature of CI, implementation distinguishes CI schemes ?? remember CIDD acronym too
Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) • “Walker” • Skipper Ginger Comparative performance evaluation Conclusion
A: bez p1, D B: p6=1 C: jmp E F: p4=p6+1 input changed re-dispatch E: p3=p1+1 input transitively changed re-dispatch F: p4=p2+1 G: p5=ld(p4) “Walker” [Rotenberg+, HPCA’99] Ooo renaming: walk all CI insns • Re-rename, re-dispatch if inputs (transitively) changed • Reactive: no penalty on correct prediction (no worse than base) • High overhead on mis-prediction • Walk and re-renames CI data independent insns (CIDI): E • Typically many more of those than CIDD • Still better than baseline
B: p6=1 C: jmp E P: p9=?? P: p9=p6 pre-synchronize “pmove” E: p3=p1+1 F: p4=p9+1 G: p5=ld(p4) Skipper [Cher+, MICRO’01] Ooo renaming: proactive CI + pre-synchronization • Defer CD fetch until branch resolves (reserve space) • Pre-synchronize: predict CD output registers (r2) and pre-allocate • After correct-path CD, dispatch/execute “pmoves” • Low ooo renaming overhead on mis-prediction • Proportional to CD region register output set • Same overhead even on correct prediction A: bez p1, D
OOO Renaming: “Walker”+SkipperGinger “Walker”: walk CI insns • Reactive: no overhead on correct predictions • High overhead on mis-predictions: proportional to CI insns Skipper: pre-synchronize • Low overhead on mis-predictions: proportional to CD registers • Proactive: same overhead on correct predictions Ginger: tag rewriting • Low overhead on mis-predictions: proportional to CD registers • Reactive: no overhead on correct predictions • Proactive also possible, but not really worth it • Uses (mostly) existing hardware • Supports ooo renaming of loads
Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger • Tag rewriting • Selective re-dispatch • Out-of-order renaming for loads • Inserting CD insns Comparative performance evaluation Conclusion
Goal: correct path CI halfway A: bez p1, D A: bez p1, D B: p6=1 B: p6=1 C: jmp E C: jmp E E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 F: p4=p6+1 F: p4=p6+1 G: p5=ld(p4) G: p5=ld(p4) Tag Rewriting at 32K Feet Recall: ooo renaming • Correctness: repair F’s r2 input p2p6 • Performance: without walking E and G also Tag rewriting: ooo renaming by register, not by insn • Identify which registers have changed (r2: p2p6) • Do a fast “search-replace” on CI insns • 1 step (“search-replace” p2p6), not 3 (re-rename E, F, G) • How to actually do both of these things you are “here”
Start: wrong path CI halfway A: bez p1, D A: bez p1, D D: p2=2 B: p6=1 C: jmp E E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 G: p5=ld(p4) G: p5=ld(p4) r1 r2 r3 r1 r2 r3 p1 p6 p3 p1 p2 1 p2 p3 p6 1 or 0 1 0 0 1 0 Tag Rewriting 1: Tracking Register Changes Active map table: correct-path mappings at E (CI start) Need: checkpoint for wrong-path mappings at E • Bitvectors identify which registers must be rewritten • Fromto = wrong-pathcorrect-path • How to get wrong-path checkpoint (“CI checkpoint”) you are “here”
D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) r1 r2 r3 r1 r2 r3 p1 p2 p3 0 0 1 0 Tag Rewriting 0: Setup Start: wrong path How do we know to create the CI checkpoint? • Predict that branch A is low-confidence [Jacobson+ MICRO’06] • Start tracking written registers How do we know where to create it? • Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04] • Take CI checkpoint before convergence PC is renamed A: bez p1, D
CI halfway A: bez p1, D B: p6=1 C: jmp E E: p3=p1+1 F: p4=p2+1 r1 G: p5=ld(p4) r2 r3 r1 r2 r3 r1 r2 r3 p1 p6 p3 p1 p2 p3 p1 p2 p2 p3 Tag Rewriting 2: Actual Tag Rewriting Tags must be re-written in two places • In younger issue queue entries • In younger map table checkpoints: to rename future insns correctly you are “here” F: p4=p2+1
Basic Tag Rewriting Approach Observe: tag rewriting hardware (mostly) exists • But used for different purposes: rename, dispatch, wakeup Exploit: borrow existing hardware • Stop the pipeline for a few cycles • Walk changed registers & tag rewrite • Restart the pipeline with correct dependences linked
dispatch tags/ready bits = = > r ptag ptag r age = = > r ptag ptag r age wakeup tags Tag Rewriting Hardware Issue queue • Existing: wakeup match = “search”, dispatch write = “replace” • Some additional logic may be necessary (age tags) Map table checkpoints • Some additional hardware here (but not associative search) • See paper
ROB map table issue queue regfile exec ready bits ? issue queue? CIDD Re-Dispatch So far: tag rewriting for insns in issue queue • ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02] • No, slows down common-case wakeup/select Now: conventional issue queue, issued insns leave as usual • CIDD insns re-dispatch from someplace • That place itself must supports tag rewriting
ROB map table issue queue regfile exec ready bits re-dispatch queue CIDD Re-Dispatch Ginger: a ROB-sized re-dispatch queue • Internal wakeup/select re-dispatch loop • Separate from issue wakeup/select • Supports tag rewriting to identify initial re-dispatch wave • Transitively identifies minimal dependent slice for re-dispatch • Segmented/pipelined and “half-bandwidth” slow • Only 2% of insns re-dispatch slow is fine
CIDD Loads CIDD loads: depend (via memory) on CD stores • How are these identified when CD stores inserted/removed? SQIP (store queue index prediction)[Sha+ MICRO’05] • Solution for large LSQ • Makes store-load forwarding act like register communication • Supports “store tag rewriting” A: bez r1, D D: st(r1)=2 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 G: r5=ld(r1)
A: bez p1, D D: st(p1)=1, @6 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p1) C D E F G – – – – D 6 SQIP and Store Tag Rewriting 15 second introduction to SQIP • Store map table: store-PC SQ index • Forwarding predictor: load-PC store-PC • Load G store D SQ index 6 • Load G’s second register tag is 6 • Load G indexes SQ at position 6 G: p5=ld(p1), 6 Store tag rewriting • Checkpoint & walk store map table • Search-and-replace old-SQ-index new-SQ-index • Re-dispatch load if SQ-index tag has changed
} D: p2=2 Convergence distance: here 2 insns E: r3=r1+1 E: p3=p1+1 F: r4=r2+1 F: p4=p2+1 G: r5=ld(r4) G: p5=ld(p4) Inserting CD Instructions A: bez p1, D Ginger uses proactive resource management (a la Skipper) • Not the same as proactive ooo renaming • Predict convergence distance • Reserve ROB, LSQ, and physical registers for them • Simplifies CD insn insertion • Simplifies commit and recovery, avoids resource deadlocks • Keeps CI stores in SQ positions: minimizes store tag rewriting • Reduces window utilization, but still better than non-CI
Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Acronym pop quiz Comparative performance evaluation Conclusion
Experimental Methodology Goal: compare ooo renaming schemes • Re-implemented “Walker”, Skipper • All things equal other than ooo renaming • Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04] Simulated configuration • 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue • 32KB hybrid gShare, 8KB confidence predictor • 2-way, 8-stage re-dispatch, 16 checkpoints • Statically computed convergence PCs & distances • CI for branches confidence <95%, convergence distance <256 Benchmarks: SPECint2000, MediaBench, CommBench • Gmeans over entire suite
Before We Start: Ideal CI Ideal CI: instantaneous, zero bandwidth ooo renaming • Not a CI limit study in any other sense • 95% confidence, 256 convergence distance limits apply Mis-predictions CI’ed: 55% Speedups: 8% SPECint, 14% Comm, 16% Media • Perfect branch prediction provides higher speedups
Comparative Performance: Ginger Mis-predictions CI’ed: 53% Speedups: 5% SPECint, 11% Comm, 12% Media • Ooo renaming overhead of tag rewriting is low: ~3%
Comparative Performance: Walker Mis-predictions CI’ed: 56% • Exploits more CI opportunities: 1 checkpoint per CI, not 2 Speedups: 1% SPECint, 7% Comm, 5% Media • High rename/dispatch bandwidth overhead
Comparative Performance: Skipper Mis-predictions CI’ed: 29% • Penalty on correct prediction possible slowdowns • Limits benefit to very low confidence branches (<80%) • In turn, limits CI opportunities Speedups: -1% SPECint, 8% Comm, 9% Media
More Insight: Dispatch Bandwidth Dispatch bandwidth: limits commit bandwidth • Overhead: slot spent on anything other than committing insn Non-CI processor overheads • Squashed insns/fetch refill stalls: big components • Full window stalls: smaller, partially due to mis-predictions vpr (SPECint)
More Insight: Dispatch Bandwidth Effect of ideal CI • Reduces squashed insns: CI insns • Reduces fetch refill stalls: don’t squash front-end insns, dispatch • Increases full window stalls: space reservation, higher utilization • Some low overhead for CIDD re-dispatch: ~2% vpr (SPECint)
More Insight: Dispatch Bandwidth Effect of realistic CI • Some additional ooo renaming overhead: tag rewrites, pmoves • Additional inefficiencies and limitations vpr (SPECint)
tag rewriting More Insight: Dispatch Bandwidth Ginger • Low ooo renaming overhead: few other inefficiencies vpr (SPECint)
More Insight: Dispatch Bandwidth • Walker: high ooo renaming bandwidth overhead • Skipper: very high ooo renaming bandwidth overhead • Restricted to very low confidence branches vpr (SPECint)
Conclusions Control independence (CI) • Complements improvements in predictor accuracy • Ooo renaming: most important feature, should be: • Low-overhead on mis-prediction • No overhead on correct prediction (“reactive”) Ginger: new reactive CI microarchitecture • Out-performs previous schemes: “Walker”, Skipper • Tag rewriting: new ooo renaming scheme • Uses (largely) existing hardware • Supports ooo memory renaming too • New re-dispatch mechanism: potentially useful beyond CI
A: beqz p1, D D: p2 = 2 D: p2 = p9 transform to “pmove”, re-dispatch E: p3 = p1+1 F: p4 = p2+1 re-dispatch G: p5 = ld(p4) re-dispatch Selective Branch Recovery [Gandhi+, HPCA’04] Ooo renaming: annul wrong-path CD instructions • Transform wrong-path CD insns to pmoves (in place) • Re-dispatch them and CIDD insns (from recovery buffer) • Limited applicability: can remove CD instructions, but not insert • Exact convergence : works for “if-then”, not “if-then-else”
Comparative Performance: SBR Mis-predictions CI’ed: 26% • Inability to insert CD insns limits CI opportunities Speedups: 0% SPECint, 5% Comm, 3% Media • CD to pmove transform adds latency possible slowdowns