490 likes | 651 Views
Neural Methods for Dynamic Branch Prediction. Daniel A. Jiménez Calvin Lin Dept. of Computer Science Dept. of Computer Science Rutgers University Univ. of Texas Austin Presented by: Rohit Mittal. Overview.
E N D
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Calvin Lin Dept. of Computer Science Dept. of Computer Science Rutgers University Univ. of Texas Austin Presented by: Rohit Mittal
Overview • Branch prediction background • Applying machine learning to branch prediction • Results and analysis • Future work and conclusions
Outline • What are branches? • Reducing branch penalties • Branch prediction • Why is branch prediction necessary? • Branch prediction basics • Issues which affect accurate branch prediction • Examples of real predictors
Branches • Instructions which can alter the flow of instruction execution in a program
The Context • How can we exploit program behavior to make it go faster? • Remove control dependences • Increase instruction-level parallelism
An Example • The inner loop of this code executes two statements each time through the loop. int foo (int w[], bool v[], int n) { int sum = 0; for (int i=0; i<n; i++) { if (v[i]) sum += w[i]; else sum += ~w[i]; } return sum; }
An Example continued • This C++ code computes the same thing with three statements in the loop. • This version is 55% faster on a Pentium 4. • Previous version had many mispredicted branch instructions. int foo2 (int w[], bool v[], int n) { int sum = 0; for (int i=0; i<n; i++) { int a = w[i]; int b = - (int) v[i]; sum += ~(a ^ b); } return sum; }
Branch Prediction • To speed up the process, pipelining overlaps execution of multiple instructions, exploiting parallelism between instructions. • Conditional branches create a problem for pipelining: the next instruction can't be fetched until the branch has executed, several stages later. • A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Branch predictors must be highly accurate to avoid mispredictions!
Why good Branch Prediction is necessary.. • Branches are frequent - 15-25% • Today’s pipelines are deeper and wider • Higher performance penalty for stalling • High Misprediction Penalty • A lot of cycles can be wasted!!!
Branch Predictors Must Improve • The cost of a misprediction is proportional to pipeline depth • As pipelines deepen, we need more accurate branch predictors • Pentium 4 pipeline has 20 stages • Future pipelines will have > 32 stages • Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage • Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline Simulations with SimpleScalar/Alpha
Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Address Cache (BTAC) or Branch Target Buffer (BTB)
Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve
Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program
Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table (BHT) is simplest • Also called a branch-prediction buffer • Lower bits of branch address index table of 1-bit values • Says whether or not branch taken last time • If branch was taken last time, then take again • Initially, bits are set to predict that all branches are taken
1-bit Branch History Table Problems : Two branches can have the same low-order bits. In a loop, 1-bit BHT will cause two mispredictions:End of loop case, when it exits instead of looping as beforeFirst time through loop on next time through code, when it predicts exit instead of looping LOOP: LOAD R1, 100(R2) MUL R6, R6, R1 SUBI R2, R2, #4 BNEZ R2, LOOP
2-bit Predictor Solution : 2-bit predictor scheme where change prediction only if mispredict twice in a row T NT Predict Taken PredictTaken T T NT NT Predict Not Taken Predict Not Taken T NT • This idea can be extended to n-bit saturating counters • Increment counter when branch is taken • Decrement counter when branch is not taken • If counter <= 2n-1, then predict the branch is taken; else not taken.
Correlating Branches • Often the behavior of one branch is correlated with the behavior of other branches. • Example C code if (aa == 2) B1 aa = 0; if (bb == 2) B2 bb = 0; if (aa != bb) B3 cc = 4; • If the first two branches are not taken, the third one will be. • B3 can be predicted with 100% accuracy based on the outcomes of B1 and B2
Correlating Branches – contd. • Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters • Old 2-bit BHT is then a (0,2) predictor
Branch PC Predicted PC Need Address at same time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address • Return instruction addresses predicted with stack PC of instruction FETCH =? Predict taken or not taken
Branch Target Buffer • A branch-target buffer or branch-target cache stores the predicted address of branches that are predicted to be taken. • Values not in the buffer are predicted to be not taken. • The branch-target buffer is accessed during the IF stage, based on the k low order bits of the branch address. • If the branch-target is in the buffer and is predicted correctly, the one cycle stall is eliminated.
Branch Predictor Accuracy • Larger tables and smarter organizations yield better accuracy • Longer histories provide more context for finding correlations • Table size is exponential in history length • The cost is increased access delay and chip area
Alpha 21264 • 8-stage pipeline, mispredict penalty 7 cycle • 64 KB, 2-way instruction cache with line and way prediction bits (Fetch) • Each 4-instruction fetch block contains a prediction for the next fetch block • Hybrid predictor (Fetch) • 12-bit GAg (4K-entry PHT, 2 bit counters) • 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit counters)
Ultra Sparc III • 14-stage pipeline, branch prediction accessed in instruction fetch stages 2-3 • 16K-entry 2-bit counter Gshare predictor • Bimodal predictor which XOR’s PC bits with global history register (except 3 lower order bits) to reduce aliasing • Miss queue • Halves mispredict penalty by providing instructions for immediate use
Pentium III • Dynamic branch prediction • 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction • Static branch predictor for BTB misses • Branch Penalties: • Not Taken: no penalty • Correctly predicted taken: 1 cycle • Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles
AMD Athlon K7 • 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch • 2K-entry bimodal predictor, 2K-entry BTB • Branch Penalties: • Correct Predict Taken: 1 cycle • Mispredict penalty: at least 10 cycles
Branch Prediction is a Machine Learning Problem • So why not apply a machine learning algorithm? • Replace 2-bit counters with a more accurate predictor • Tight constraints on prediction mechanism • Must be fast and small enough to work as a component of a microprocessor • Artificial neural networks • Simple model of neural networks in brain cells • Learn to recognize and classify patterns • Most neural nets are slow and complex relative to tables • For branch prediction, we need a small and fast neural method
A Neural Method for Branch Prediction • Several neural methods were investigated • Most were too slow, too big, or not accurate enough • The perceptron[Rosenblatt `62, Block `62] • Very high accuracy for branch prediction • Prediction and update are quick, relative to other neural methods • Sound theoretical foundation; perceptron convergence theorem • Proven to work well for many classification problems
Branch-Predicting Perceptron • Inputs (x’s) are from branch history register • Weights (w’s) are small integers learned by on-line training • Output (y)gives prediction; dot product of x’s and w’s • Training finds correlations between history and outcome • w0 – bias, independent of the history
Training Perceptrons • W’ – i.e. new weights vector, might be a worse set of weights for any other training example. It is not evident that this is a useful algorithm. • Perception Convergence Theorem: If any set of weights exist that correctly classify a finite set of training examples, then perceptron learning will come up with a (possibly different) set of weights that also correctly classifies all examples after a finite number of change steps, for a finite separable set of training examples.
Linear Separability • A limitation of perceptrons is that they are only capable of learning linearly separable functions • A boolean function over variables xi..n is linearly separable iff there exist values for wi..n such that all the true instances can be separated from all the false instances by a hyperplane defined by the solution of: n w0 + ∑ xi wi = 0 i=1 • i.e. If n = 2, the hyperplane is a line.
Linear Separability – contd. • Example: a perceptron can learn the logical AND for two inputs but not the XOR. • A perceptron can still give good predictions for inseparable functions but will not achieve 100% accuracy. In contrast a two level PHT (pattern history table) scheme like gshare can learn any boolean function if given enough time.
Putting it all together – perceptron based predictor • The Branch address is hashed into the table of perceptrons • The ith perceptron is fetched, into a vector register, P1..n of weights. • The value of y is computed as the dot product of P and the global history register • The branch is predicted not taken if y is negative, or taken otherwise • Once this branch is resolved, the outcome is used by the training algorithm to update P • P is written back to the ith entry in the table
Organization of the Perceptron Predictor • Keeps a table of perceptrons, indexed by branch address • Inputs are from branch history register • Predict taken if output 0, otherwise predict not taken • Key intuition: table size isn't exponential in history length, so we can consider much longer histories
Results: Predictor Accuracy • Perceptron outperforms competitive hybrid predictor by 36% at ~4KB; 1.71% vs. 2.66%
Results: Large Hardware Budgets • Multi-component hybrid was the most accurate fully dynamic predictor known in the literature [Evers 2000] • Perceptron predictor is even more accurate
Results: IPC with high clock rate • Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz • 15.8% higher IPC than gshare, 5.7% higher than hybrid
Analysis: History Length • The fixed-length path branch predictor can also use long histories [Stark, Evers & Patt `98]
Analysis: Training Times • Perceptron “warms up’’ faster
Future Work with Perceptron Predictor • Let's make the best predictor even better • Better representation • Better training algorithm • Latency is a problem • How can we eliminate the latency of the perceptron predictor?
Future Work with Perceptron Predictor • Value prediction • Predict which set of values is likely to be the result of a load operation to mitigate memory latency • Indirect branch prediction • Virtual dispatch • Switch statements in C
Future Work Characterizing Predictability • Branch predictability, value predictability • How can we characterize algorithms in terms of their predictability? • Given an algorithm, how can we transform it so that its branches and values are easier to predict? • How much predictability is inherent in the algorithm, and how much is an artifact of the program structure? • How can we compare different algorithms' predictability?
Conclusions • Neural predictors can improve performance for deeply pipelined microprocessors • Perceptron learning is well-suited for microarchitectural implementation • There is still a lot of work left to be done on the perceptron predictor in particular and microarchitectural prediction in general