260 likes | 487 Views
Merging Path, Global and Local Indexing in Perceptron Branch Prediction. David Tarjan. Published in:. An Ahead Pipelined Alloyed Perceptron with Single Cycle Access Time D. Tarjan, K. Skadron and M. Stan Workshop on Complexity Effective Design (WCED), June 2004
E N D
Merging Path, Global and Local Indexing in Perceptron Branch Prediction David Tarjan 1
Published in: • An Ahead Pipelined Alloyed Perceptron with Single Cycle Access TimeD. Tarjan, K. Skadron and M. Stan Workshop on Complexity Effective Design (WCED), June 2004 • Merging path and gshare indexing in perceptron branch predictionD. Tarjan and K. Skadron ACM Transactions on Architecture and Code Optimization, 2(3), Sep. 2005 2
Why Yet Another Branch Predictor? • Single-thread performance growth stalling • Pipeline length still slowly increasing • Buffer sizes also increasing • No more “free” clock scaling • Power budget goes to more cores -> If we want more single-thread performance, we have to go for efficiency! 3
Outline • What is a perceptron? • Ahead-pipelining • Precomputing local sums • Results for ahead pipelined alloyed perceptron • Hashed Indexing • Results of a hashed perceptron • Conclusion 4
Main Contributions • Reduced latency of perceptron predictors to one cycle • Showed how to reduce number of weights/adders by N (for N:6-12) for a given history length • Reduced mispredictions by up to 27.2% over path-based perceptron 5
Main Problems in Branch Prediction: • Accuracy (larger tables, more logic) • Latency (smaller tables, less logic) • Multiple Branch/Trace/Stream/etc. per cycle Addressing these two points Tradeoff! 8
Ahead pipelined perceptron p_addr(x) = addr(x) + direction(x) 10
Hashed Perceptron: Motivation • Want longer history for accuracy • But that means more adders • Also need more bits per weight for very long history • With ahead-pipelining kind of have two bits of history for each weight… • But more ahead-pipelining means fewer address bits to select weight… We have been here before! 14
Benefits? • We can reduce number of tables and adders by n, where n is the number of hist. bits per table • We can accurately predict linearly inseparable branches (two branches which have XOR pattern) 17
Related Work • O-GEHL: Optimized GEometric History Length Predictor [Seznec2004] • gDAC: global Divide And Conquer [Loh2005] • PWLB: Piecewise Linear Branch Predictor [Jimenez2004] • TAGE: TAgged GEometric History Branch Predictor [Seznec&Michaud2006] 20
Conclusion • Can make a perceptron predictor single cycle latency • Assigning multiple bits to a single weight helps for both accuracy and power • More accuracy is only good with low latency 21
Q & A 22
RET1 RET2 IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX PREF DEC DEC EXEC WB P5 Microarchitecture P6 Microarchitecture TC NextIP TC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Ck Drive NetBurst Microarchitecture (Willamette) ~30 stages ??? NetBurst Microarchitecture (Prescott) Why Yet Another Branch Predictor? (ca. 2003) Graphics from Prof. Hsien-Hsin Sean Lee presentation On the pentium pro/pentium 4 microarchitecture 23