1 / 50

Advanced Microarchitecture

Advanced Microarchitecture. Lecture 4: Branch Predictors. Direction vs. Target. Direction: 0 or 1 Target: 32- or 64-bit value Turns out targets are generally easier to predict Don’t need to predict NT target T target doesn’t usually change or has “nice” pattern like subroutine returns.

yves
Download Presentation

Advanced Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Microarchitecture Lecture 4: Branch Predictors

  2. Direction vs. Target • Direction: 0 or 1 • Target: 32- or 64-bit value • Turns out targets are generally easier to predict • Don’t need to predict NT target • T target doesn’t usually change • or has “nice” pattern like subroutine returns Lecture 4: Correlated Branch Predictors

  3. Branches Have Locality • If a branch was previously taken, there’s a good chance it’ll be taken again in the future for(i=0; i < 100000; i++) { /* do stuff */ } This branch will be taken 99,999 times in a row. Lecture 4: Correlated Branch Predictors

  4. Simple Predictor • Always predict NT • no fetch bubbles (always just fetch the next line) • does horribly on previous for-loop example • Always predict T • does pretty well on previous example • but what if you have other control besides loops? p = calloc(num,sizeof(*p)); if(p == NULL) error_handler( ); This branch is practically never taken Lecture 4: Correlated Branch Predictors

  5. Last Outcome Predictor • Do what you did last time • 0xDC08: for(i=0; i < 100000; i++) • { • 0xDC44: if( ( i % 100) == 0 ) • tick( ); • 0xDC50: if( (i & 1) == 1) • odd( ); • } T N Lecture 4: Correlated Branch Predictors

  6. 99.998% Prediction Rate DC44: TTTTT ... TNTTTTT ... TNTTTTT ... 2 / 100 98.0% DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT … 2 / 2 0.0% Misprediction Rates? DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT … 100,000 iterations NT TN How often is branch outcome != previous outcome? 2 / 100,000 Lecture 4: Correlated Branch Predictors

  7. 2 3 0 1 FSM for 2bC (2-bit Counter) Saturating Two-Bit Counter Predict NT Predict T Transistion on T outcome Transistion on NT outcome 0 1 FSM for Last-Outcome Prediction Lecture 4: Correlated Branch Predictors

  8. Initial Training/Warm-up 0 1 1 1 1 … 1 1 0 1 1 … T T T T T T N T T T           0 1 2 3 3 … 3 3 2 3 3 … T T T T T T N T T T           Example 1bC: 2bC: Only 1 Mispredict per N branches now! DC08: 99.999% DC04: 99.0% Lecture 4: Correlated Branch Predictors

  9. Importance of Branches • 98%  99% • Whoop-Dee-Do! • Actually, it’s 2% misprediction rate  1% • That’s a halving of the number of mispredictions • So what? • If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is: 5*(1 + ½ + (½)2 + (½)3 + … ) = 10 • If we halve the miss rate down to 25%: 5*(1 + ¾ + (¾)2 + (¾)3 + … ) = 20 • Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from Lecture 4: Correlated Branch Predictors

  10. table update FSM Update Logic Actual outcome Typical Organization of 2bC Predictor … back to predictors hash PC 32 or 64 bits n entries/counters log2 n bits Prediction Lecture 4: Correlated Branch Predictors

  11. Typical Hash • Just take the log2n least significant bits of the PC • May need to ignore a few bits • In a 32-bit RISC ISA, all instructions are 4 bytes wide, and all instruction addresses are 4-byte aligned  least two significant bits of PC are always zeros and so they are not included • equivalent to right-shifting PC by two positions before hashing • In a variable-length CISC ISA (ex. x86), instructions may start on arbitrary byte boundaries • probably don’t want to shift Lecture 4: Correlated Branch Predictors

  12. How about the Branch at 0xDC50? • 1bc and 2bc don’t do too well (50% at best) • But it’s still obviously predictable • Why? • It has a repeating pattern: (NT)* • How about other patterns? (TTNTN)* • Use branch correlation • The outcome of a branch is often related to previous outcome(s) Lecture 4: Correlated Branch Predictors

  13. prediction = N prev = 1 3 0 1 3 3 prediction = T prev = 0 3 0 prediction = N prev = 1 3 0 prediction = T prev = 0 prediction = T 3 0 prev = 0 3 2 prediction = T prediction = T prediction = T prev = 1 prev = 1 prev = 1 3 3 3 3 3 2 Idea: Track the History of a Branch Previous Outcome PC Counter if prev=0 1 3 0 Counter if prev=1  Lecture 4: Correlated Branch Predictors

  14. Deeper History Covers More Patterns • What pattern has this branch predictor entry learned? Last 3 Outcomes Counter if prev=000 Counter if prev=001 Counter if prev=010 PC 0 0 1 1 3 1 0 3 2 0 2 Counter if prev=111 001  1; 011  0; 110  0; 100  1 00110011001… (0011)* Lecture 4: Correlated Branch Predictors

  15. Predictor Organizations PC Hash PC Hash PC Hash Different pattern for each branch PC Shared set of patterns Mix of both Lecture 4: Correlated Branch Predictors

  16. Example (1) • 1024 counters (210) • 32 sets ( ) • 5-bit PC hash chooses a set • Each set has 32 counters • 32 x 32 = 1024 • History length of 5 (log232 = 5) • Branch collisions • 1000’s of branches collapsed into only 32 sets PC Hash 5 5 Lecture 4: Correlated Branch Predictors

  17. Example (2) • 1024 counters (210) • 128 sets ( ) • 7-bit PC hash chooses a set • Each set has 8 counters • 128 x 8 = 1024 • History length of 3 (log28 = 3) • Limited Patterns/Correlation • Can now only handle history length of three PC Hash 7 3 Lecture 4: Correlated Branch Predictors

  18. Two-Level Predictor Organization • Branch History Table (BHT) • 2a entries • h-bit history per entry • Pattern History Table (PHT) • 2b sets • 2h counters per set • Total Size in bits • h2a + 2(b+h)2 PC Hash a h b Each entry is a 2-bit counter Lecture 4: Correlated Branch Predictors

  19. Classes of Two-Level Predictors • h = 0 or a = 0 (Degenerate Case) • Regular table of 2bC’s (b = log2counters) • h > 0, a > 1 • “Local History” 2-level predictor • h > 0, a = 1 • “Global History” 2-level predictor Lecture 4: Correlated Branch Predictors

  20. Global vs. Local Branch History • Local Behavior • What is the predicted direction of Branch A given the outcomes of previous instances of Branch A? • Global Behavior • What is the predicted direction of Branch Z given the outcomes of all* previous branches A, B, …, X and Y? * number of previous branches tracked limited by the history length Lecture 4: Correlated Branch Predictors

  21. Why Global Correlations Exist • Example: related branch conditions p = findNode(foo); if ( p is parent ) do something; do other stuff; /* may contain more branches */ if ( p is a child ) do something else; A: Outcome of second branch is always opposite of the first branch B: Lecture 4: Correlated Branch Predictors

  22. Other Global Correlations • Testing same/similar conditions • code might test for NULL before a function call, and the function might test for NULL again • in some cases it may be faster to recompute a condition rather than save a previous computation in memory and re-load it • partial correlations: one branch could test for cond1, and another branch could test for cond1 && cond2 (if cond1 is false, then the second branch can be predicted as false) • multiple correlations: one branch tests cond1, a second tests cond2, and a third tests cond1 cond2 (which can always be predicted if the first two branches are known). Lecture 4: Correlated Branch Predictors

  23. PC Hash b h b+h A Global-History Predictor Single global branch history register (BHR) h PC Hash b Lecture 4: Correlated Branch Predictors

  24. Similar Tradeoff Between B and H • For fixed number of counters • Larger h  Smaller b • Larger h  longer history • able to capture more patterns • longer warm-up/training time • Smaller b  more branches map to same set of counters • more interference • Larger b  Smaller h • just the opposite… Lecture 4: Correlated Branch Predictors

  25. Motivation for Combined Indexing • Not all 2h “states” are used • (TTNN)* only uses half of the states for a history length of 3, and only ¼ of the states for a history length of 4 • (TN)* only uses two states no matter how long the history length is • Not all bits of the PC are uniformly distributed • Not all bits of the history are uniformly likely to be correlated • more recent history more likely to be strongly correlated Lecture 4: Correlated Branch Predictors

  26. Combined Index Example: gshare • S. McFarling (DEC-WRL TR, 1993) PC Hash k k XOR k = log2counters Lecture 4: Correlated Branch Predictors

  27. Gshare example Insufficient history leads to a conflict Lecture 4: Correlated Branch Predictors

  28. Some Interference May Be Tolerable • Branch A: always not-taken • Branch B: always taken • Branch C: TNTNTN… • Branch D: TTNNTTNN… 000 0 001 3 010 3 011 0 100 3 101 0 110 0 111 3 Lecture 4: Correlated Branch Predictors

  29. And Then It Might Not • Branch X: TTTNTTTN… • Branch Y: TNTNTN… • Branch Z: TTTT… 000 001 010 3 011 3 100 101 ? 3 110 3 111 0 ? Lecture 4: Correlated Branch Predictors

  30. Interference Reducing Predictors • There are patterns and asymmetries in branches • Not all patterns occur with same frequency • Branches have biases • This lecture: • Bi-Mode (Lee et al., MICRO 97) • gskewed (Michaud et al., ISCA 97) • These are global history predictors, but the ideas can be applied to other types of predictors Lecture 4: Correlated Branch Predictors

  31. Gskewed idea • Interference occurs because two (or more) branches hash to the same index • A different hash function can prevent this collision • but may cause other collisions • Use multiple hash functions such that a collision can only occur in a few cases • use a majority vote to make final decision Lecture 4: Correlated Branch Predictors

  32. Gskewed organization PC Global Hist hash1 hash2 hash3 PHT3 PHT2 PHT1 if hash1(x) = hash1(y) then: hash2(x)  hash2(y) hash3(x)  hash3(y) maj prediction Lecture 4: Correlated Branch Predictors

  33. B Gskewed example A maj Lecture 4: Correlated Branch Predictors

  34. Combining Predictors • Some branches exhibit local history correlations • ex. loop branches • While others exhibit global history correlations • “spaghetti logic”, ex. if-elsif-elsif-elsif-else branches • Using a global history predictor prevents accurate prediction of branches exhibiting local history correlations • And visa-versa Lecture 4: Correlated Branch Predictors

  35. Tournament Hybrid Predictors Pred0 Pred1 Meta- Predictor table of 2-/3-bit counters Final Prediction If meta-counter MSB = 0, use pred0 else use pred1 Lecture 4: Correlated Branch Predictors

  36. Common Combinations • Global history + Local history • “easy” branches + global history • 2bC and gshare • short history + long history • Many types of behavior, many combinations Lecture 4: Correlated Branch Predictors

  37. Multi-Hybrids • Why only combine two predictors? P0 P1 M01 P2 P3 M23 P0 P1 P2 P3 M MM prediction prediction • Tradeoff between making good individual predictions (P’s) vs. making good meta-predictions (M’s) • for a fixed hardware budget, improving one may hurt the other Lecture 4: Correlated Branch Predictors

  38. P0 P1 P2 P3 M prediction Prediction Fusion • Selection discards information from n-1 predictors • Fusion attempts to synthesize all information • more info to work with • possibly more junk to sort through P0 P1 P2 P3 M prediction Lecture 4: Correlated Branch Predictors

  39. Using Long Branch Histories • Long global history provides more context for branch prediction/pattern matching • more potential sources of correlation • Costs • For PHT-based approach, HW cost increases exponentially: O(2h) counters • Training time increases, which may decrease overall accuracy Lecture 4: Correlated Branch Predictors

  40. Predictor Training Time • Ex: prediction equals opposite for 2nd most recent • Hist Len = 2 • 4 states to train: • NN  T • NT  T • TN  N • TT  N • Hist Len = 3 • 8 states to train: • NNN  T • NNT  T • NTN  N • NTT  N • TNN  T • … Lecture 4: Correlated Branch Predictors

  41. Neural Branch Prediction • Uses “Perceptron” from classical machine learning theory • simplest form of a neural-net (single-layer, single-node) • Inputs are past branch outcomes • Compute weighted sum of inputs • output is linear function of inputs • sign of output is used for the final prediction Lecture 4: Correlated Branch Predictors

  42. 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 1 1 -1 -1 -1 -1 1 xn xn-1 x2 x1 x0 wn w17 w16 w15 w14 w13 w12 w11 w10 w9 w8 w7 w6 w5 w4 w3 w2 w1 w0 x x x x x x x x x x x x x x x x x x x Adder Perceptron Predictor “bias” 1 0 1 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 1  0 prediction Lecture 4: Correlated Branch Predictors

  43. Perceptron Predictor (2) • Magnitude of weight wi determines how correlated branch i is to the current branch • Sign of weight determines postitive or negative correlation • Ex. outcome is usually opposite as 5th oldest branch • w5 has large magnitude (L), but is negative • if x5 is taken, then w5x5 = -L1 = -L • tends to make sum more negative (toward a NT prediction) • if x5 is not taken, then w5x5 = -L-1 = L Lecture 4: Correlated Branch Predictors

  44. Perceptron Predictor (3) • When actual branch outcome is known: • if xi = outcome, then increment wi (positive correlation) • if xi outcome, then decrement wi (negative correlation) • for x0, increment if branch taken, decrement if NT • “Done with training” • if |S wi| > q, then don’t update weights unless mispred Lecture 4: Correlated Branch Predictors

  45. Perceptron Trains Quickly • If no correlation exists with branch i, then wi will just get incremented and decremented back and forth, wi 0 • If correlation exists with branch j, then wj will be consistently incremented (or decremented) to have a large influence on the overall sum Lecture 4: Correlated Branch Predictors

  46. xj 1 T N T N -1 xi -1 1 f() = -3*xi -4*xj – 5 wj wi w0 Linearly Inseparable Functions • Perceptron computes linear combination of inputs • Can only learn linearly separable functions xj 1 N N N T -1 xi -1 1 • No values of wi, wj, w0 exist to satisfy these output • No straight line exists that separates T’s from N’s Lecture 4: Correlated Branch Predictors

  47. Overall Hardware Organization PC Hash Table of weights one set of weights … BHR Multipliers … adder prediction = sign(sum) Size = (h+1)*k*n + h + Area(mult) + Area(adder) h = history length, k = counter width, n = number of perceptrons in table Table BHR Lecture 4: Correlated Branch Predictors

  48. GEHL • GEometric History Length predictor very long branch history L2 L3 L1 L(i) = ai-1 L(1) L4 History lengths form a geometric progression PC h1 h2 h3 h4 K-bit weights adder prediction = sign(sum) Lecture 4: Correlated Branch Predictors

  49. PPM Predictors • PPM = Partial Pattern Matching • Used in data compression • Idea: Use longest history necessary, but no longer Most Recent Oldest PC h1 h2 h3 h4 2bc 2bc Partial tags 2bc Partial tags 2bc Partial tags 2bc Partial tags = = = = Pred 0 1 0 1 0 1 0 1 Lecture 4: Correlated Branch Predictors

  50. TAGE Predictor • Similar to PPM, but uses geometric history lengths • Currently the most accurate type of branch prediction algorithm • References (www.jilp.org): • PPM: Michaud (CBP-1) • O-GEHL: Seznec (CBP-1) • TAGE: Seznec & Michaud (JILP) • L-TAGE: Seznec (CBP-2) Lecture 4: Correlated Branch Predictors

More Related