430 likes | 612 Views
Revisiting the perceptron predictor. André Seznec IRISA/ INRIA. Perceptron-based branch prediction Jimenez and Lin HPCA 2001. Radically new approach to branch prediction Associate a set of 8-bit counters or weights with a branch address
E N D
Revisiting the perceptron predictor André Seznec IRISA/ INRIA
Perceptron-based branch prediction Jimenez and Lin HPCA 2001 • Radically new approach to branch prediction • Associate a set of 8-bit counters or weights with a branch address • Use the global history vector as an input vector (+1, -1) • Multiply/accumulate weights by inputs and use the sign as a prediction • Selective update: • Increment/decrement if misprediction • Or if Sum is lower than a threshold
∑ Perceptron predictor X Sign=prediction
Perceptron prediction works • + Complexity linear with the history length: • Can capture correlation on a very long history length • - But: • long latency: the multiply-accumulate tree ! • Inherently unabled to discriminate between two histories if they are not « linearly separable » • (2 weights, 2 history bits): h0 h1 is not recognized ! Can we do better ?
Use a redundant history • Insert several bits per branch in history to enhance linear separability: h0, h0h1, h0h2, h0add
Redundant history perceptron • + significant misprediction reduction: • > 30 % for 12 out of 20 benchmarks • - 256 weights: • A 256 multiply-add tree: 2048 bits wide !! • 256 counter updates !! • Latency ? • Power consumption ? • Logic complexity ?
4 weights for 2 history bits = a single counter read • Inputs (0, h0, h1, h0 h1), weights W0, W1, W2, W3 • Possible contributions to the branch prediction: • h=0 (0,0,0,0) C0= -W0 –W1-W2-W3 • h=1 (0,1,0,1) C1= -W0 +W1-W2+W3 • h=2 (0,0,1,1) C2= -W0 –W1+W2+W3 • h=3 (0,1,1,0) C3= -W0 +W1+W2-W3 • Update for h =2 and Out =1: • C2 +=4 • C0, C1 and C3 unchanged Let us store the Multiply-Accumulate contributions instead of the weights !!
MAC contribution: 4-way redundant history • Let us really represent blocks of 4 history bits per 16 weights • there are only 16 possible multiply-accumulate contributions associated with these 16 weights Storing the Multiply-Accumulate contributions instead of the weights !!
∑ Redundant History Perceptron Predictor with MAC contribution 4N history bits Sign=prediction N 16x1 MUX
Redundant history and MAC representation • Replace a 16 multiply-add tree by a 16-to1 MUX • Use of saturated arithmetic: • one can reduce the width of counters to 6-bit A 256 8-bit multiply-accumulate tree replaced by a 16 6-bit adder tree
Redundant History Perceptron vs optimized 2bcgskew • Optimized 2bcgskew: 1Mbit 72-36-9-9 history + lots of tricks • 768 Kbits redundant history perceptron • 20 benchmarks: SPEC 2000 + SPEC 95 fifty / fifty!! Perceptron and 2bcgskew do not capture exactly the same kind of correlation !!
Towards the best of both worlds ! Redundant history skewed perceptron predictor
Self-aliasing on a perceptron predictor 1. Consider H and H’ for a branch B differing on recent bits, If both behaviors are dictated by the same coincidating « old » history segment (e.g. bits 20-23), then there is an aliasing effect on a counter!! 2.Most of the correlation is captured by recent history: Most counters associated with « old » history are « wasted » 3. Let us enable the use of whole spectrum of counters through the use of multiple tables with different indices : « SKEWING »
∑ Redundant History Skewed Perceptron Predictor 4 tables accessed with different indices
Further leveraging long history • Some applications may benefit from history length up to 128 bits, many do not !! • Don’t want to use a wider adder tree • For a fixed history length, the number of pathes that lead to a single branch varies in a considerable way • less information in some history sections than in others: • Repeating patterns « waste » space in history Use of a compressed form of history !
Further leveraging long history (2) • Replace repeating patterns (up to 5 bits) by narrower chains • 1.5-3 compression ratio on our benchmark set • Use half uncompressed history and half compressed history • Significant benefit ( > 25 %) on several benchmarks; harmless for the others • Essentially captures all correlation associated with local history
Addressing the predictor latency Ahead pipelined redundant history perceptron predictor
The latency issue ! • Single cycle prediction would be needed but: • 2-4 cycles for table read • 2-4 cycles for adder tree • Ahead pipelined 2bcgskew, Seznec and Fraboulet, ISCA 2003 • on the fly information insertion in table indices • resolve misprediction at execution time • Path-based perceptron, Jiménez MICRO2003 • « systolic-like » ahead pipelined perceptron prediction • does not address table read delay • resolve misprediction at commit time, not at execution time
Ahead pipelining the RHSP: the challenges • Use of X-block ahead information to initiate branch prediction: • X-block ahead address and global history • Use intermediate path information to ensure prediction accuracy • But, inflight insertion of table indices is not sufficient !?! • Need to checkpoint every information for recomputing on the fly any possible prediction for the X-1 intermediate blocks • But avoid checkpoint volume explosion
∑ Ahead pipelined Redundant History Skewed Perceptron Predictor 5 1-block ahead history X block ahead Sum on 14 counters + RHSP tables read 32 counters for intermediate pathes
Ahead pipelined Redundant History Skewed Perceptron Predictor • Partial sum using only X-block ahead information • Discriminate only 32 possible paths: • 32 associated counters are read • Compute 32 possible sums • Select the prediction on last cycle • Checkpoint the 32 possible predictions
Ahead pipelined RHSP • Very limited loss of accuracy for 6-block ahead: • 5 1-bit ahead history are sufficient to discriminate among all the intermediate pathes • Loss of accuracy increases with the length of prediction: • Do not discriminate between all the pathes • explosion of the number of pathes originated from the same X-block ahead block: • Less and less predictions performed by low order counters
Summary • Perceptron based prediction improved: • Prediction accuracy • Use of redundant history • Introduction of skewing • Introduction of history compression • MAC representation: • 16 6-bit adder tree against 256 8-bit mult/acc. tree • X-block ahead RHSP: • on-time prediction without sacrificing accuracy or penalty • misprediction resolution at execution stage
Wide possible design space • For dealing with his/her implementation constraint, the designer can play with: • Number of tables • Width of histories • Compressed/uncompressed ratio • Threshold/width of counters: • Half threshold/ 5 bits counters is not so bad • Use of other MAC representation • 8 counters for 3 bits, 16 counters for 5 bits • ..
Bonus An « objective » comparison of RHSP and 2bcgskew by their (common) inventor
e-gskew 2bc-gskew : logical view
Optimized 2bcgskew • All optimizations in EV8 predictor: • Different history lengthes for all tables • Different hysteresis and prediction table sizes • + a few other tricks: • Sharing predictors and hysteresis tables through banking • Randomly enforcing the flipping of counters on mispredictions to avoid ping-pong phenomena • No « guru » design hash functions: just good functions • 2**(N+11) bits predictor; (N,N,4N,8N) history • (4,4,16,32) for 32Kbit • (9,9,36,72) for 1Mbit
2bcgskew vs RHSP (1) Efficiency of the prediction scheme: • Both can use very long history: • Extra local history prediction brings very poor benefit • Not aware of any other predictor handling such long history • RHSP better tolerates/accomodates compressed history • RHSP captures some extra correlation Efficiency of the storage usage (small size predictors, e.g. 32Kbits): • 2bcgskew more efficient on a few demanding benchmarks: go, gcc95 • RHSP surprisingly efficient on most benchmarks
2bcgskew vs RHSP (2) Accesses to the predictor: • Up to three accesses on RHSP on correct predictions • But not so many, accesses on correct predictions • Single access to prediction, single access to hysteresis on correct predictions on 2bcgskew
2bcgskew vs RHSP (3) • Hardware logic cost: • Adder tree + counter update for RHSP • Hashing functions + small logic for 2bcgskew • Latency: • Table read + adder tree for RHSP • Table read + a few gates for 2bcgskew