Advanced Techniques in Branch Predictor Design for Enhanced Performance

TAGE-SC-L AgainMTAGE-SC André Seznec INRIA/IRISA

Where do these predictors come from ? • GEHL:CBP 2004 , ISCA 2005 • TAGE:JILP 2006, CBP 2006 • Statisticalcorrelation:CBP 2011 • Combining more info:Micro 2011, CBP 2014, Micro 2015 • Optimizingeverything: CBP 2016 • Unlimited:CBP 2014 CBP 2016

Around 2002 • Introduction of perceptron predictor (Jimenez01) • State-of-the-art : EV8 predictor • Lagging behind perceptron on a few benchmarks • + with EV8-like: • some applications would benefit from 100+ history bits Both able to handle « long » global histories: 30+ branches

CBP 2004 GEOMETRIC HISTORY LENGTH PREDICTOR

A Multiple length global history predictor T0 T1 T2 Σ L(0) T3 L(1) L(2) T4 L(3) L(4) With a limited number of tables

Underlying idea • H and H’ two history vectors equal on N bits, but differ on bit N+1 • e.g. L(1)NL(2) • Branches (A,H) and (A,H’) biased in opposite directions Table T2 should allow to discriminate between (A,H) and (A,H’)

GEometric History Length predictor The set of history lengths forms a geometric series {0, 2, 4, 8, 16, 32, 64, 128} What is important:L(i)-L(i-1) is drastically increasing Spends most of the storage for short history !!

GEHL (CBP 2004) • Neural inspired • Use of 200+ bits of global history • Narrow counters • Dynamic threshold update

TAgged GEometric history length predictor JILP 2006 TAGE

At CBP 2004, only neural predictors apart PPM-like predictor (Michaud 2004) but .. The update policy was poor

TAGE (JILP 2006) • Partial tag match • almost .. • Geometric history length • Very effective update policy

TAGE: Tagged and prediction by the longest history matching entry h[0:L1] pc pc pc h[0:L2] pc h[0:L3] ctr ctr ctr tag tag tag u u u 1 1 1 1 1 1 1 =? =? =? 1 1 prediction Tagless base predictor

Miss Hit Pred =? =? 1 1 1 1 1 1 1 =? 1 Hit 1 Altpred

Prediction computation • General case: • Longest matching component provides the prediction • Special case: • Many mispredictions on newly allocated entries: weak Ctr On many applications, Altpred more accuratethan Pred • Property dynamically monitored through 4-bit counters

A tagged table entry Tag U Ctr • Ctr: 3-bit prediction counter • U: 1 or 2-bit counters • Was the entry recently useful ? • Tag: partial tag

Allocate entries on mispredictions • Allocate entries in longer history length tables • On tables with U unset • Set Ctr to Weak and U to 0 • Limited storage budget: • Allocate 2 entries (when 15 to 20 different history lengths)

Managing the (U)seful counter • Increment when avoids a misprediction • (Pred = taken) & (Altpred ≠ taken) Becomes « useful » • Global decrement when it becomes « difficult » to allocate: • Many possible heuristics (« difficult » ≈ 2/3 of the entries useful)  CBP 2016 heuristics: ≈ 0.5 % MPKI

TAGE vs GEHL: • At equal sizes: ≈ 10 % MPKI reduction May vary with individual benchmarks !

Optimizations for CBP2016 • Sharing storage space • Small hist. sharing a bank-interleaved table • Small tag (8 bits) • Long hist. sharing a bank-interleaved table • Longer tag (12 bits) • Partial associativity • 2 banks for medium hist. Lengths ≈ 2 % MPKI reduction

Statistical Corrector (Global history) CBP2011 TAGE + (G)SC

From CBP 2011,«the Statistical Corrector targets » • Branches with poor correlation with history: • Sometimes better predicted by a single wide PC indexed counter than by TAGE • More generally, track cases such that: • « For this (PC, history, prediction), TAGE is likely (>50 %) to mispredict » statistically

TAGE-GSC ( CBP 2011)(was named a posteriori in Micro 2015) ≈3-5% MPKI red. PC +Global history (Main) TAGE Predictor Prediction + Confidence Stat. Cor. PPC + Globhist Just a global hist neural predictor: + tables indexed with PC, TAGE pred. and confidence

Confidence for TAGE (HPCA 2011) • The value of the counter providing the prediction: Saturated = high confidence Intermediate= medium confidence Weak = low confidence

Why does it work • The bias tables indexedwith PC+TAGE outputs: • Correct (most of the time) • High counter value • Dominates, not many updates • Wrong • Othercounterscanbetrained • (Statistical) Correlation (if itexists) canbecaptured

Optimizations for CBP 2016 • Use TAGE confidence for indexing SC ≈ 1 % MPKI red. • On (very) low SC confidence: • May use TAGE prediction (if high conf, ..) ≈ 0.4 % MPKI red.

The beauty of neural predictors Micro 2011, CBP 2014, Micro 2015 TAGE-SC

From Compaq in 1999 OK, I cheated with loops • I learnt: • Use global history • Avoid local history Did manage to submitonly global historyat CBP 2004, 2006 and 2011

Speculative history must be managed !? • Local history: • table of histories (unspeculatively updated) • must maintain a speculative history per inflight branch: • Associative search, etc ?!? • Global history: • Append a bit on asinglehistory register • Use of a circular buffer and just a pointer to speculatively manage the history

Would not have won CBP 2014 without using local history

How to use local histories with TAGE+(G)SC • Add the local history tables in the neural SC • as in the perceptron [Jimenez2002] ≈ 0.9 % MPKI reduction with 2Kbits on the 8KB predictor ≈ 2.5 % MPKI reduction with 28Kbits on the 64KB predictor I DO NOT ADVOCATE FOR LOCAL HISTORIES IN REAL HARDWARE PROCESSORS

The beauty of neural predictors • TAGE-SC: • Just the right framework to test information vectors • Add extra tables: some benefit ! continue to explore

Can add extra components in SC • IMLI-based components Micro2015 • Capture correlation in multidimensional loops • Very disappointing results essentially no benefit on CBP5 traces • Other forms of history: • E.g. only backward branches

+ a loop predictor (just in case) TAGE-SC-L

Loop predictor • Can predictloop exit • for loopswith large iterationnumbers • regularnumber of iterations • Limited storage budget (a few entries) • But marginal benefit I DO NOT ADVOCATE FOR LOCAL HISTORIES IN REAL HARDWARE PROCESSORS

TAGE-SC-L summary for CBP-5 Most of the budget on global hist. correlation: -TAGE with ≈ 1200 br. for 64 KB and ≈ 400 br. for 8KB -optimize the storage sharing -optimize the allocation Track the statistical correlation with a neural component: -use TAGE prediction AND confidence -incorporate other forms of history (even local history if you are trying to win CBP-5)

TAGE-SC-L is still far from the predictability limits MTAGE-SC

poTAGE-SC: the previous champion poTAGE+COLT (Michaud2014) and TAGE-SC-L

poTAGE + COLT (Michaud2014) TAGE predictors a (PC + 5 pred) indexed table Global history Local history 1 Local history 2 COLT selection Local History 3 Frequency Use TAGE concept on other forms of hist.

Unlimited TAGE-SC Statistical Corrector TAGE predictor Global history Bias GEHL RHSP Final choser other GEHL and perceptrons ...

poTAGE-SC TAGE predictors Statistical Corrector Global history Bias GEHL Local history 1 RHSP Local history 2 Final choser COLT selection other GEHL and perceptrons Local History 3 ... Frequency

MTAGE-SC TAGE predictors Statistical Corrector Global history Bias GEHL Local history 1 RHSP Local history 2 Final choser TAGE prediction combiner Local History 3 ... other GEHL and perceptrons Frequency Global backwardhistory

MTAGE-SC TAGE predictors Statistical Corrector Global history Bias GEHL Local history 1 ≈ 5 % MPKI reduction over poTAGE-SC RHSP Local history 2 Final choser other GEHL and perceptrons TAGE prediction combiner Local History 3 ... Frequency Leverages confidence from SC and TAGE pred. combiner Global backwardhistory TAGE prediction combiner: COLT pred + neural combination of outputs pred + confidence Global backward history: to capture long path correlation, but eliminate intermediate branches A few extra history forms: IMLI, ..

Seems that I am not making progress !! • CBP 2006 misp. rate: • 32KB L-TAGE ≈ 1.22 GTL • CBP 2014 misp.rate: • 32KB TAGE-SC-L ≈ 1.40 poTAGE-SC • CBP 2016 misp.rate: • 64KB TAGE-SC-L ≈ 1.55 MTAGE-SC Not the same traces, but ..

Conclusion • TAGE-SC-L fits limited storage sizes: • Most significant optimizations over CBP 2014 • Use of TAGE confidence as index for SC • Sharing and partial associativity • MTAGE-SC: • Predictability limits even (a little bit) further that previously expected

Advanced Techniques in Branch Predictor Design for Enhanced Performance

Advanced Techniques in Branch Predictor Design for Enhanced Performance

Presentation Transcript

SC.912.L.15.6 Classification

SC.912.L.15.1 Evolution

Early L(1)sc expression: Are there L(1)sc equivalence groups?

SC.912.L.16.3

SC.912.L.14.3

sc

SC.912.L.18.9

TAGE-SC-L Branch Predictors

SC.912.L.17.5

SC.912.L.14.7

SC.912.L.15.8

SC.912.L.18.9

SC.912.L.15.6

Interdependence SC.5.L.17.1

TAGE-SC-L Branch Predictors

SC.912.L.14.52

SC.912.L.16.13

SC.912.L.14.7

SC.912.L.15.6

SC.912.L.17.20

SC.912.L.17.5