160 likes | 232 Views
Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlated Branches from a Large Global History. Renjiu Thomas, Manoij Franklin, Chris Wilkerson, and Jared Stark Presenter: Xiaoxiao Wang. Agenda. Motivation and Related Work Identifying Affector Branches at Run-time
E N D
Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlated Branches from a Large Global History Renjiu Thomas, Manoij Franklin, Chris Wilkerson, and Jared Stark Presenter: Xiaoxiao Wang
Agenda • Motivation and Related Work • Identifying Affector Branches at Run-time • Building Predictors Using Affector Information • Experimental Results • Conclusion
Motivation and Related Work • Processor pipelines have been growing deeper. Branch misprediction penalty will become very high[18]. • Small predictors’ accuracy can be greatly improved by size increase, but not for large predictors [1][5][12][13][16][19]. • Larger predictors increase prediction delay [2][8][16]. • Future transistor budgets permit larger area for branch predictors [4][16].
How to Improve Prediction Rate? • Not all branches in the long history may be correlated to the branch under prediction [11][20][21]--- more selective use. • Two primary reasons for related branches [6]: 1) proceeding branch’s outcome affects computation that determines the outcome of the succeeding branch (affector). 2) computations affecting their outcomes are (fully of partially) based on the same (or related) information (forerunner). • Identify correlated branches from a large global history.
Identifying Affector Branches at Run-time BB0 R1=R2 • B8 is to be determined • Latest 5 branches {B0 B2 B3 B5 B7} (TNTTN) • Affector blocks for B8: {BB2 BB3 BB7} => Affector branch for B8: {B0, B2, B5} • Affector Branch Bitmap for B8 is 11010 • Tracking the runtime dataflow and determine the affector branches for the last updates of each Architecture Register. B0 N T R1=R2+4 BB2 B2 N R2=R1+R2 BB3 B3 N BB5 BB6 B5 T R3=R4+4 BB7 B7 N BB8 B8 If R2==R3 N T
Affectors Affector Register File (ARF) Structure • Keepa separate record of affector information corresponding to each architecture register a entry in ARF. 0 1 2 30 1 0 0 0 0 0 1 1 0 1 0 31
Affector Branch Bitmap (ABB) Generation Algorithm • Principle 1: When the processor encounters a conditional branch, all entries in the ARF are shifted left by 1 bit and fill 0. • Principle 2: When the processor encounters a register-writing instruction, the ARF entries corresponding to the source registers are read, OR’ed together and written to the ARF entry corresponding to the destination register with a 1 in LSB. • Principle 3: When the processor encounters a conditional branch instruction, the ARF entries corresponding to its source registers are read and OR’ed generating ABB.
R0 X X X X 0 R0 0 0 0 0 0 X X X X 1 1 0 0 0 0 R1 R1 X X X X 0 1 1 0 0 0 R2 R2 X X X X 0 0 0 0 1 0 R3 R3 Princ3: R0 X X X 0 0 ABB X X X 1 0 R1 Princ1: X X X 0 0 1 1 0 1 0 R2 X X X 0 0 R3 R0 X X X 0 0 X X X 1 0 R1 Princ2: X X X 1 1 R2 X X X 0 0 R3 Affector Branch Bitmap (ABB) Generation Algorithm ARF after I2 ARF after B7 I0: R1=R2 BB0 B0 N T I2: R1=R2+4 BB2 B2 ARF after B2 N I3:R2=R1+R2 BB3 B3 N BB5 BB6 B5 ARF after I3 T I7: R3=R4+4 BB7 B7 N BB8 B8 If R2==R3 N T
Princ4: Shift right by 4-bit Misprediction Recovery • Principle 4: When a branch misprediction is detected, speculative updates to ARF after the mispredicted branch should be shifted out. Mispredicted branch 1 0 0 0 0 0 1 1 0 1 0 X X X X 1 0 0 0 0 0 1
Global History 1 0 1 1 0 1 1 1 0 1 1 … 1 0 1 Affector Bitmap 0 0 0 0 1 1 0 1 1 0 0 … 0 0 1 Mask (AND) Fold XOR Predictor Look Up Index Building Predictors Using Affector Information: Zeroing Scheme • Turning off Non-affector Bits and Hashing: Zeroing Scheme • All non-affector bits in the long global history are masked to become zeros by ANDing the branch’s ABB and the long global history. • Result is hashed down to the required number of bits using a fold and XOR hash technique. • The identified affectors are retained in their respective positions. 0 0 0 0 0 1 0 1 0 0 0 … 0 0 1
Fold XOR Building Predictors Using Affector Information: Packing Scheme • Turning off Non-affector Bits Packing and Hashing: Packing Scheme • Remove the non-affectors altogether. • Result is hashed down to the required number of bits using a fold and XOR hash technique. • The identified affectors are not retained in their respective positions. Global History 1 0 1 1 0 1 1 1 0 1 1 … 1 0 1 Affector Bitmap 0 0 0 0 1 1 0 1 1 0 0 … 0 0 1 Mask (AND) 0 0 0 0 0 1 0 1 0 0 0 … 0 0 1 Pack 0 11 0 … 1 Predictor Look Up Index
Read ARF Read Corrector Predictor (Rare Event Prediction) Hash Instruction Line Predictor Global History Compare Tag Primary Global Predictor Hit Corrector Prediction Proposed Predictor Organization Stage1 Stage2 Stage3 Stage4 One Cycle Prediction Primary Prediction (Perceptron or YAGS)
Experiment Setup • SimpleScalar v3.0 using Alpha ISA • 12 benchmarks from SPEC95 and SPEC2000 integer benchmark suites.
Experimental Evaluation Figure 1. Misprediction Results for Zeroing and Packing techniques for our Corrector Predictor along with (i) Perceptron Primary Predictor; (ii) YAGS Primary Predictor
Experimental Evaluation Figure 2. (i) Performance of a Modeled Superscalar for Various Branch Corrector Predictor Schemes. (ii) Per-benchmark Misprediction Rates for the Corresponding Corrector Predictors.
Conclusion • The hard-to-predict branches of a primary global predictor is predicted by a very accurate corrector predictor with one or two cycles additional latency. • A technique by which a long global history can be used for this corrector predictor by identifying correlated branches in history using run-time dataflow information is proposed. • Two prediction schemes Zeroing and Packing are proposed. • Adding a 8KB affector history based corrector predictor to a 16KB perceptron primary predictor decreases the average misprediction rate for 12 benchmarks from 6.3% to 5.7%.