Advanced Microarchitecture

Advanced Microarchitecture Lecture 4: Branch Predictors

Direction vs. Target • Direction: 0 or 1 • Target: 32- or 64-bit value • Turns out targets are generally easier to predict • Don’t need to predict NT target • T target doesn’t usually change • or has “nice” pattern like subroutine returns Lecture 4: Correlated Branch Predictors

Branches Have Locality • If a branch was previously taken, there’s a good chance it’ll be taken again in the future for(i=0; i < 100000; i++) { /* do stuff */ } This branch will be taken 99,999 times in a row. Lecture 4: Correlated Branch Predictors

Simple Predictor • Always predict NT • no fetch bubbles (always just fetch the next line) • does horribly on previous for-loop example • Always predict T • does pretty well on previous example • but what if you have other control besides loops? p = calloc(num,sizeof(*p)); if(p == NULL) error_handler( ); This branch is practically never taken Lecture 4: Correlated Branch Predictors

Last Outcome Predictor • Do what you did last time • 0xDC08: for(i=0; i < 100000; i++) • { • 0xDC44: if( ( i % 100) == 0 ) • tick( ); • 0xDC50: if( (i & 1) == 1) • odd( ); • } T N Lecture 4: Correlated Branch Predictors

99.998% Prediction Rate DC44: TTTTT ... TNTTTTT ... TNTTTTT ... 2 / 100 98.0% DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT … 2 / 2 0.0% Misprediction Rates? DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT … 100,000 iterations NT TN How often is branch outcome != previous outcome? 2 / 100,000 Lecture 4: Correlated Branch Predictors

2 3 0 1 FSM for 2bC (2-bit Counter) Saturating Two-Bit Counter Predict NT Predict T Transistion on T outcome Transistion on NT outcome 0 1 FSM for Last-Outcome Prediction Lecture 4: Correlated Branch Predictors

Initial Training/Warm-up 0 1 1 1 1 … 1 1 0 1 1 … T T T T T T N T T T           0 1 2 3 3 … 3 3 2 3 3 … T T T T T T N T T T           Example 1bC: 2bC: Only 1 Mispredict per N branches now! DC08: 99.999% DC04: 99.0% Lecture 4: Correlated Branch Predictors

Importance of Branches • 98%  99% • Whoop-Dee-Do! • Actually, it’s 2% misprediction rate  1% • That’s a halving of the number of mispredictions • So what? • If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is: 5*(1 + ½ + (½)2 + (½)3 + … ) = 10 • If we halve the miss rate down to 25%: 5*(1 + ¾ + (¾)2 + (¾)3 + … ) = 20 • Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from Lecture 4: Correlated Branch Predictors

table update FSM Update Logic Actual outcome Typical Organization of 2bC Predictor … back to predictors hash PC 32 or 64 bits n entries/counters log2 n bits Prediction Lecture 4: Correlated Branch Predictors

Typical Hash • Just take the log2n least significant bits of the PC • May need to ignore a few bits • In a 32-bit RISC ISA, all instructions are 4 bytes wide, and all instruction addresses are 4-byte aligned  least two significant bits of PC are always zeros and so they are not included • equivalent to right-shifting PC by two positions before hashing • In a variable-length CISC ISA (ex. x86), instructions may start on arbitrary byte boundaries • probably don’t want to shift Lecture 4: Correlated Branch Predictors

How about the Branch at 0xDC50? • 1bc and 2bc don’t do too well (50% at best) • But it’s still obviously predictable • Why? • It has a repeating pattern: (NT)* • How about other patterns? (TTNTN)* • Use branch correlation • The outcome of a branch is often related to previous outcome(s) Lecture 4: Correlated Branch Predictors

prediction = N prev = 1 3 0 1 3 3 prediction = T prev = 0 3 0 prediction = N prev = 1 3 0 prediction = T prev = 0 prediction = T 3 0 prev = 0 3 2 prediction = T prediction = T prediction = T prev = 1 prev = 1 prev = 1 3 3 3 3 3 2 Idea: Track the History of a Branch Previous Outcome PC Counter if prev=0 1 3 0 Counter if prev=1  Lecture 4: Correlated Branch Predictors

Deeper History Covers More Patterns • What pattern has this branch predictor entry learned? Last 3 Outcomes Counter if prev=000 Counter if prev=001 Counter if prev=010 PC 0 0 1 1 3 1 0 3 2 0 2 Counter if prev=111 001  1; 011  0; 110  0; 100  1 00110011001… (0011)* Lecture 4: Correlated Branch Predictors

Predictor Organizations PC Hash PC Hash PC Hash Different pattern for each branch PC Shared set of patterns Mix of both Lecture 4: Correlated Branch Predictors

Example (1) • 1024 counters (210) • 32 sets ( ) • 5-bit PC hash chooses a set • Each set has 32 counters • 32 x 32 = 1024 • History length of 5 (log232 = 5) • Branch collisions • 1000’s of branches collapsed into only 32 sets PC Hash 5 5 Lecture 4: Correlated Branch Predictors

Example (2) • 1024 counters (210) • 128 sets ( ) • 7-bit PC hash chooses a set • Each set has 8 counters • 128 x 8 = 1024 • History length of 3 (log28 = 3) • Limited Patterns/Correlation • Can now only handle history length of three PC Hash 7 3 Lecture 4: Correlated Branch Predictors

Two-Level Predictor Organization • Branch History Table (BHT) • 2a entries • h-bit history per entry • Pattern History Table (PHT) • 2b sets • 2h counters per set • Total Size in bits • h2a + 2(b+h)2 PC Hash a h b Each entry is a 2-bit counter Lecture 4: Correlated Branch Predictors

Classes of Two-Level Predictors • h = 0 or a = 0 (Degenerate Case) • Regular table of 2bC’s (b = log2counters) • h > 0, a > 1 • “Local History” 2-level predictor • h > 0, a = 1 • “Global History” 2-level predictor Lecture 4: Correlated Branch Predictors

Global vs. Local Branch History • Local Behavior • What is the predicted direction of Branch A given the outcomes of previous instances of Branch A? • Global Behavior • What is the predicted direction of Branch Z given the outcomes of all* previous branches A, B, …, X and Y? * number of previous branches tracked limited by the history length Lecture 4: Correlated Branch Predictors

Why Global Correlations Exist • Example: related branch conditions p = findNode(foo); if ( p is parent ) do something; do other stuff; /* may contain more branches */ if ( p is a child ) do something else; A: Outcome of second branch is always opposite of the first branch B: Lecture 4: Correlated Branch Predictors

Other Global Correlations • Testing same/similar conditions • code might test for NULL before a function call, and the function might test for NULL again • in some cases it may be faster to recompute a condition rather than save a previous computation in memory and re-load it • partial correlations: one branch could test for cond1, and another branch could test for cond1 && cond2 (if cond1 is false, then the second branch can be predicted as false) • multiple correlations: one branch tests cond1, a second tests cond2, and a third tests cond1 cond2 (which can always be predicted if the first two branches are known). Lecture 4: Correlated Branch Predictors

PC Hash b h b+h A Global-History Predictor Single global branch history register (BHR) h PC Hash b Lecture 4: Correlated Branch Predictors

Similar Tradeoff Between B and H • For fixed number of counters • Larger h  Smaller b • Larger h  longer history • able to capture more patterns • longer warm-up/training time • Smaller b  more branches map to same set of counters • more interference • Larger b  Smaller h • just the opposite… Lecture 4: Correlated Branch Predictors

Motivation for Combined Indexing • Not all 2h “states” are used • (TTNN)* only uses half of the states for a history length of 3, and only ¼ of the states for a history length of 4 • (TN)* only uses two states no matter how long the history length is • Not all bits of the PC are uniformly distributed • Not all bits of the history are uniformly likely to be correlated • more recent history more likely to be strongly correlated Lecture 4: Correlated Branch Predictors

Combined Index Example: gshare • S. McFarling (DEC-WRL TR, 1993) PC Hash k k XOR k = log2counters Lecture 4: Correlated Branch Predictors

Gshare example Insufficient history leads to a conflict Lecture 4: Correlated Branch Predictors

Some Interference May Be Tolerable • Branch A: always not-taken • Branch B: always taken • Branch C: TNTNTN… • Branch D: TTNNTTNN… 000 0 001 3 010 3 011 0 100 3 101 0 110 0 111 3 Lecture 4: Correlated Branch Predictors

And Then It Might Not • Branch X: TTTNTTTN… • Branch Y: TNTNTN… • Branch Z: TTTT… 000 001 010 3 011 3 100 101 ? 3 110 3 111 0 ? Lecture 4: Correlated Branch Predictors

Interference Reducing Predictors • There are patterns and asymmetries in branches • Not all patterns occur with same frequency • Branches have biases • This lecture: • Bi-Mode (Lee et al., MICRO 97) • gskewed (Michaud et al., ISCA 97) • These are global history predictors, but the ideas can be applied to other types of predictors Lecture 4: Correlated Branch Predictors

Gskewed idea • Interference occurs because two (or more) branches hash to the same index • A different hash function can prevent this collision • but may cause other collisions • Use multiple hash functions such that a collision can only occur in a few cases • use a majority vote to make final decision Lecture 4: Correlated Branch Predictors

Gskewed organization PC Global Hist hash1 hash2 hash3 PHT3 PHT2 PHT1 if hash1(x) = hash1(y) then: hash2(x)  hash2(y) hash3(x)  hash3(y) maj prediction Lecture 4: Correlated Branch Predictors

B Gskewed example A maj Lecture 4: Correlated Branch Predictors

Combining Predictors • Some branches exhibit local history correlations • ex. loop branches • While others exhibit global history correlations • “spaghetti logic”, ex. if-elsif-elsif-elsif-else branches • Using a global history predictor prevents accurate prediction of branches exhibiting local history correlations • And visa-versa Lecture 4: Correlated Branch Predictors

Tournament Hybrid Predictors Pred0 Pred1 Meta- Predictor table of 2-/3-bit counters Final Prediction If meta-counter MSB = 0, use pred0 else use pred1 Lecture 4: Correlated Branch Predictors

Common Combinations • Global history + Local history • “easy” branches + global history • 2bC and gshare • short history + long history • Many types of behavior, many combinations Lecture 4: Correlated Branch Predictors

Multi-Hybrids • Why only combine two predictors? P0 P1 M01 P2 P3 M23 P0 P1 P2 P3 M MM prediction prediction • Tradeoff between making good individual predictions (P’s) vs. making good meta-predictions (M’s) • for a fixed hardware budget, improving one may hurt the other Lecture 4: Correlated Branch Predictors

P0 P1 P2 P3 M prediction Prediction Fusion • Selection discards information from n-1 predictors • Fusion attempts to synthesize all information • more info to work with • possibly more junk to sort through P0 P1 P2 P3 M prediction Lecture 4: Correlated Branch Predictors

Using Long Branch Histories • Long global history provides more context for branch prediction/pattern matching • more potential sources of correlation • Costs • For PHT-based approach, HW cost increases exponentially: O(2h) counters • Training time increases, which may decrease overall accuracy Lecture 4: Correlated Branch Predictors

Predictor Training Time • Ex: prediction equals opposite for 2nd most recent • Hist Len = 2 • 4 states to train: • NN  T • NT  T • TN  N • TT  N • Hist Len = 3 • 8 states to train: • NNN  T • NNT  T • NTN  N • NTT  N • TNN  T • … Lecture 4: Correlated Branch Predictors

Neural Branch Prediction • Uses “Perceptron” from classical machine learning theory • simplest form of a neural-net (single-layer, single-node) • Inputs are past branch outcomes • Compute weighted sum of inputs • output is linear function of inputs • sign of output is used for the final prediction Lecture 4: Correlated Branch Predictors

1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 1 1 -1 -1 -1 -1 1 xn xn-1 x2 x1 x0 wn w17 w16 w15 w14 w13 w12 w11 w10 w9 w8 w7 w6 w5 w4 w3 w2 w1 w0 x x x x x x x x x x x x x x x x x x x Adder Perceptron Predictor “bias” 1 0 1 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 1  0 prediction Lecture 4: Correlated Branch Predictors

Perceptron Predictor (2) • Magnitude of weight wi determines how correlated branch i is to the current branch • Sign of weight determines postitive or negative correlation • Ex. outcome is usually opposite as 5th oldest branch • w5 has large magnitude (L), but is negative • if x5 is taken, then w5x5 = -L1 = -L • tends to make sum more negative (toward a NT prediction) • if x5 is not taken, then w5x5 = -L-1 = L Lecture 4: Correlated Branch Predictors

Perceptron Predictor (3) • When actual branch outcome is known: • if xi = outcome, then increment wi (positive correlation) • if xi outcome, then decrement wi (negative correlation) • for x0, increment if branch taken, decrement if NT • “Done with training” • if |S wi| > q, then don’t update weights unless mispred Lecture 4: Correlated Branch Predictors

Perceptron Trains Quickly • If no correlation exists with branch i, then wi will just get incremented and decremented back and forth, wi 0 • If correlation exists with branch j, then wj will be consistently incremented (or decremented) to have a large influence on the overall sum Lecture 4: Correlated Branch Predictors

xj 1 T N T N -1 xi -1 1 f() = -3*xi -4*xj – 5 wj wi w0 Linearly Inseparable Functions • Perceptron computes linear combination of inputs • Can only learn linearly separable functions xj 1 N N N T -1 xi -1 1 • No values of wi, wj, w0 exist to satisfy these output • No straight line exists that separates T’s from N’s Lecture 4: Correlated Branch Predictors

Overall Hardware Organization PC Hash Table of weights one set of weights … BHR Multipliers … adder prediction = sign(sum) Size = (h+1)*k*n + h + Area(mult) + Area(adder) h = history length, k = counter width, n = number of perceptrons in table Table BHR Lecture 4: Correlated Branch Predictors

GEHL • GEometric History Length predictor very long branch history L2 L3 L1 L(i) = ai-1 L(1) L4 History lengths form a geometric progression PC h1 h2 h3 h4 K-bit weights adder prediction = sign(sum) Lecture 4: Correlated Branch Predictors

PPM Predictors • PPM = Partial Pattern Matching • Used in data compression • Idea: Use longest history necessary, but no longer Most Recent Oldest PC h1 h2 h3 h4 2bc 2bc Partial tags 2bc Partial tags 2bc Partial tags 2bc Partial tags = = = = Pred 0 1 0 1 0 1 0 1 Lecture 4: Correlated Branch Predictors

TAGE Predictor • Similar to PPM, but uses geometric history lengths • Currently the most accurate type of branch prediction algorithm • References (www.jilp.org): • PPM: Michaud (CBP-1) • O-GEHL: Seznec (CBP-1) • TAGE: Seznec & Michaud (JILP) • L-TAGE: Seznec (CBP-2) Lecture 4: Correlated Branch Predictors

Advanced Microarchitecture