480 likes | 823 Views
Branch Prediction Techniques. 15-740 Computer Architecture. Vahe Poladian & Stefan Niculescu October 14, 2002. Papers surveyed. A Comparative Analysis of Schemes for Correlated Branch Prediction by Cliff Young, Nicolas Gloy, and Michael D. Smith.
E N D
Branch Prediction Techniques 15-740 Computer Architecture Vahe Poladian & Stefan Niculescu October 14, 2002
Papers surveyed • A Comparative Analysis of Schemes for Correlated Branch Prediction by Cliff Young, Nicolas Gloy, and Michael D. Smith. • Improving Branch Predictors by Correlating on Data Values by Timothy Heil, Zak Smith, and James E Smith. • A Language for Describing Predictors and its Application to Automatic Synthesis by Joel Emer and Nikolas Gloy.
A Comparative Analysis of Schemes for Correlated Branch Prediction
Framework • Branch execution = (b,d), b is PC, d is 0 or 1 • All prediction schemes described by this model Divider Substreams Predictors Execution Stream b5,1 b3,1 b4,0 b5,1
Differences among prediction schemes • Path History vs Pattern History • Path: (b1,d1), … , (bn,dn), pattern: (d1, … , dn) • Aliasing extent • Multiple streams using the same predictor • Extent of cross-procedure correlation • Adaptivity • Static vs dynamic
Path History vs. Pattern History • Path potentially more accurate • Compared to baseline 2 bit per branch predictor, path only slightly improves over pattern • Path requires significant storage • Result holds both in static and dynamic predictors
Aliasing vs Non-Aliasing • Can be constructive, destructive, harmless • Completely removing aliasing slightly improves accuracy over GAs and Gshare with 4096 2-bit counters • Should we spend effort on techniques reducing aliasing? • Unliased path history slightly better vs. unaliased pattern history • With aliasing constraint, this distinction might be insignificant, so designers should be careful • Further, under equal table space constraint, path history might even be worse
Cross-procedure Correlation • Often mispredictions of the branches just after procedure entry or just after procedure return • Static predictor with cross-procedure correlation support performs significantly better than one without • Strong bias per stream increased • This result somewhat meaningless, as hardware predictors do not suffer from this problem
Static vs Dynamic • Number of distinct streams for which static predictor better is higher, but • Number of branches executed in dynamic streams for which dynamic is better, is significantly higher • Is it possible to combine static and dynamic predictors? • How? • Assign low bias streams to dynamic
Summary - lessons learnt • Path history performs slightly better than pattern history • Removing the effects of aliasing decreases misprediction, but increases predictor size • Exploiting cross-procedure correlation improves the prediction accuracy • Percentage of adaptive streams small, but dynamic branches executed are significant • Use hybrid schemes to improve accuracy
Learning Predictors Using Genetic Programming
Genetic Algorithms • Optimization technique based on simulating natural selection process • High probability that the global optimum is among the results • Principles: • The stronger individuals survive • The offsprings of stronger parents tend to combine the strengths of the parents • Mutations may appear as result of the evolution process
An Abstract Example Distribution of Individuals in Generation 0 Distribution of Individuals in Generation N
Prediction using GAs • Find Branch Predictors that yield low misprediction rates • Find Indirect Jump predictors with low misprediction rates • Find other good predictors (not addressed in the paper, but potential for a research project)
Prediction using GAs Algorithm • Find efficient encoding of predictors • Start with a set of random predictors (“generation 0”) - 400 • Given generationI (20-30 overall): • Rank predictors according to fitness function • Choose best to make generationi+1: • Copy • Crossover • Mutation
Primitive predictor Primitive Predictor – P[w,d](Index;Update) Update Index d Result w • Basic memory unit • Depth - number of entries • Width - number of bits per entry
Algebraic notation – BP expressions • Onebit[d](PC;T) = P[1;d](PC;T); • Counter[n,d](I;T)= = P[n,d](I; if T then P+1 else P-1); • Twobit[d](PC;T)= = MSB(Counter[2,d](PC;T));
Two Bit predictor MSB IF T Update Index P SADD SSUB PC 2 2 SELF SELF 1 1 Predictor Tree – an example Question: how to do crossover and mutation?
Constraints • Validity of expressions • E.g. of NOT valid BP: in crossover, terminal T may become the index of another predictor • If not valid, try to modify the individual to a valid BP expression (e.g. T=1) • Encapsulation • Size of storage limited to 512Kbits • When bigger, reduce size by randomly decreasing the side of a predictor node by one
Fitness function • Intuitively, the higher the accuracy, the better a predictor is: fitness(P) = accuracy(P) • To compute fitness: • Parse expression • Create subroutines to simulate predictor • Run a simulator over benchmarks (SPECint92, SPECInt95, IBS compiled for DEC Alpha) to compute accuracy of the predictor • Not efficient ... Why? Suggestions?
Results – branch prediction • The 6 best predictors kept – 30 generations
Results – Indirect jumps • Best handcrafted predictors: 47% miss • Best learnt predictor: 15% miss • Very complicated structure • Simple learnt predictor with 33.4% miss
Summary • A powerful algebraic notation for encoding multiple types of predictors • Genetic Algorithms can be successfully applied to obtain very good predictors • Best learnt branch predictors comparable with GShare • Best learnt indirect jump predictors outperform the already existing ones • In general the best learnt predictors are too complex to implement • However, subexpressions of these predictors might be useful for creating simpler, more accurate predictors.
References: • Genetic Algorithms: A Tutorial* by Wendy Williams • Automatic Generation of Branch Predictors via Genetic Programming by Ziv Bar-Yossef and Kris Hildrum * Note: we reused some slides with author’s consent
Improving Branch Predictors by Correlating on Data Values
The Problem • Despite improvements in prediction techniques, such as • Adding global path info • Refining prediction techniques • Reducing branch table interference • … Branch misprediction still a big problem • Goals of work • Understand why • Remedy the problem
Mispredicted Branches • Loops that iterate too many times • Last branch almost always mispredicted, since history (global or local) not long enough • Large switch statement close to a branch • Gets the predictors confused • Common in applications such as a compiler • Insight: PC: CondJmpEq Ra, Rb, Target • Use the data value
Global History Branch Predictor Branch PC Data Value History Using Data Values Directly
Global History Branch Predictor Branch PC Data Value History Using Data Values Directly • Challenges: • Large number of data values (typically two values involved) • Out-of-order execution delays the update of values needed
Intricacies – Too Many Values • Store differences of source registers • Store value patterns, not values • Handle only exceptional cases • A special predictor, called REP, which is the primary predictor, if value pattern already in it • If pattern not yet in REP, i.e. a non-exceptional case, let Backup (gselect) handle • If Backup mispredicts, then insert value to REP • REP provides data correlation and reduces interference for Backup • Replacement policy of REP critical
Intricacies – Guessing values • Value not available when predicting • Using committed data not accurate • Employing data prediction expensive • Idea: use last-known good value + a dynamic counter indicating outstanding instances (fetched but not committed) of that same branch
Optimal Configuration Design • Design space of BCD very large – how to come up with a good (optimal) one? • Use the results of extensive experiments to determine various configuration parameters • No claim of optimality, but pretty good • Optimal configuration: • REP: indexed by GBH + PC, 6 KB table, 2048 x 3 byte entries. 10 bits for “pattern” tag, 8 for branch prediction, 6 for replacement policy • VHT: 2 separate tables: the data cache, and the branch count table, indexed by PC
Conclusions / Discussion • Adding data value information useful to branch prediction • Rare event predictor useful way to handle large number of data values and reduce interference in the traditional predictor • Can be used with other kinds of predictors
A: If A==0 M: If … Y: If A>0 B: If A==2 Pattern-History vs Path-History • AMY, pattern “11” => (Y,0) • BMY, pattern “11” => (Y,1) • Using pattern history greatly improves accuracy over per-branch static predictor • Using Path history – little improvement over pattern history
Algebraic notation – BP expressions • Onebit[d](PC;T) = P[1;d](PC;T); • Counter[n,d](I;T)= = P[n,d](I; if T then P+1 else P-1); • Twobit[d](PC;T) =MSB(Counter[2,d](PC;T)); • Hist[w,d](I;V) = P[w,d](I;P||V); • Gshare[m](PC;T) = = Twobit[2m](PC ⊕ Hist[m,1](0;T); T);
Tree Representation • Three types of nodes: • Predictors • Primitive predictor + width + height • Has two descendants: • Left: index expression • Right: update expression • Functions … not an exhaustive list • XOR, CAT, MASKHI/MASKLO, IF, SATUR,MSB • Terminals … not an exhaustive list • PC, Result of the branch (T), SELF(value P)
Results – Indirect jumps • Existing jump predictors’ performance:
Crossover • Randomly choose a node in each of the parents and interchange the corresponding subtrees • What bad things could happen?
Mutation • Applied to children generated by crossover • Node Mutation: • Replace functions with functions • Replace terminal with another terminal • Modify width/height of predictor • Tree Mutation: • Randomly pick a node N • Replace Subtree(N) with random subtree of same height
Chooser Global History Branch Predictor Branch PC Data Value Predictor Data Value History Branch Execution What are some of the problems with this approach? Using Data Values
Using Data Values: Problems • Uses either branch history or data values, but not both • Latency of prediction too high • The data value predictor requires one or two serial table accesses • Plus execution of the branch instruction
Experimentation - initial • Use interference-free tables, fully populated REC, for each PC, global history, value, and count combination • Values artificially “aged “ by throwing away n most recent values, thus making branch counts (n+1) • Compare with gselect • Run with 5 of the less predictable apps of SPECint95: compress, gcc, go, jpeg, li. • Vary the amount of difference values stored, from 1 to 3
Results - initial • BDP outperforms gselect • Best gain when using a single branch difference – adding second and third give little improvement • The older the branch difference, the worse the prediction, but degradation slow • Effect on individual branches – varies, but on average, BDP does better, with very few exceptions