Revisiting the perceptron predictor

Revisiting the perceptron predictor André Seznec IRISA/ INRIA

Perceptron-based branch prediction Jimenez and Lin HPCA 2001 • Radically new approach to branch prediction • Associate a set of 8-bit counters or weights with a branch address • Use the global history vector as an input vector (+1, -1) • Multiply/accumulate weights by inputs and use the sign as a prediction • Selective update: • Increment/decrement if misprediction • Or if Sum is lower than a threshold

∑ Perceptron predictor X Sign=prediction

Perceptron prediction works • + Complexity linear with the history length: • Can capture correlation on a very long history length • - But: • long latency: the multiply-accumulate tree ! • Inherently unabled to discriminate between two histories if they are not « linearly separable » • (2 weights, 2 history bits): h0  h1 is not recognized ! Can we do better ?

Use a redundant history • Insert several bits per branch in history to enhance linear separability: h0, h0h1, h0h2, h0add

Redundant history perceptron • + significant misprediction reduction: • > 30 % for 12 out of 20 benchmarks • - 256 weights: • A 256 multiply-add tree: 2048 bits wide !! • 256 counter updates !! • Latency ? • Power consumption ? • Logic complexity ?

4 weights for 2 history bits = a single counter read  • Inputs (0, h0, h1, h0  h1), weights W0, W1, W2, W3 • Possible contributions to the branch prediction: • h=0  (0,0,0,0) C0= -W0 –W1-W2-W3 • h=1  (0,1,0,1) C1= -W0 +W1-W2+W3 • h=2  (0,0,1,1) C2= -W0 –W1+W2+W3 • h=3  (0,1,1,0) C3= -W0 +W1+W2-W3 • Update for h =2 and Out =1: • C2 +=4 • C0, C1 and C3 unchanged Let us store the Multiply-Accumulate contributions instead of the weights !!

MAC contribution: 4-way redundant history • Let us really represent blocks of 4 history bits per 16 weights • there are only 16 possible multiply-accumulate contributions associated with these 16 weights Storing the Multiply-Accumulate contributions instead of the weights !!

∑ Redundant History Perceptron Predictor with MAC contribution 4N history bits Sign=prediction N 16x1 MUX

Redundant history and MAC representation • Replace a 16 multiply-add tree by a 16-to1 MUX • Use of saturated arithmetic: • one can reduce the width of counters to 6-bit  A 256 8-bit multiply-accumulate tree replaced by a 16 6-bit adder tree

Redundant history and MAC representation

Back to finite storage predictors

Redundant History Perceptron vs optimized 2bcgskew • Optimized 2bcgskew: 1Mbit 72-36-9-9 history + lots of tricks  • 768 Kbits redundant history perceptron • 20 benchmarks: SPEC 2000 + SPEC 95 fifty / fifty!! Perceptron and 2bcgskew do not capture exactly the same kind of correlation !!

Towards the best of both worlds ! Redundant history skewed perceptron predictor

Self-aliasing on a perceptron predictor 1. Consider H and H’ for a branch B differing on recent bits, If both behaviors are dictated by the same coincidating « old » history segment (e.g. bits 20-23), then there is an aliasing effect on a counter!! 2.Most of the correlation is captured by recent history: Most counters associated with « old » history are « wasted » 3. Let us enable the use of whole spectrum of counters through the use of multiple tables with different indices : « SKEWING »

∑ Redundant History Skewed Perceptron Predictor 4 tables accessed with different indices

Redundant History Skewed Perceptron Predictor

Further leveraging long history • Some applications may benefit from history length up to 128 bits, many do not !! • Don’t want to use a wider adder tree • For a fixed history length, the number of pathes that lead to a single branch varies in a considerable way • less information in some history sections than in others: • Repeating patterns « waste » space in history Use of a compressed form of history !

Further leveraging long history (2) • Replace repeating patterns (up to 5 bits) by narrower chains • 1.5-3 compression ratio on our benchmark set • Use half uncompressed history and half compressed history • Significant benefit ( > 25 %) on several benchmarks; harmless for the others • Essentially captures all correlation associated with local history

RHSP and compressed history

Addressing the predictor latency Ahead pipelined redundant history perceptron predictor

The latency issue ! • Single cycle prediction would be needed but: • 2-4 cycles for table read • 2-4 cycles for adder tree • Ahead pipelined 2bcgskew, Seznec and Fraboulet, ISCA 2003 • on the fly information insertion in table indices • resolve misprediction at execution time • Path-based perceptron, Jiménez MICRO2003 • « systolic-like » ahead pipelined perceptron prediction • does not address table read delay • resolve misprediction at commit time, not at execution time

Ahead pipelining the RHSP: the challenges • Use of X-block ahead information to initiate branch prediction: • X-block ahead address and global history • Use intermediate path information to ensure prediction accuracy • But, inflight insertion of table indices is not sufficient !?! • Need to checkpoint every information for recomputing on the fly any possible prediction for the X-1 intermediate blocks • But avoid checkpoint volume explosion

∑ Ahead pipelined Redundant History Skewed Perceptron Predictor 5 1-block ahead history X block ahead Sum on 14 counters + RHSP tables read 32 counters for intermediate pathes

Ahead pipelined Redundant History Skewed Perceptron Predictor • Partial sum using only X-block ahead information • Discriminate only 32 possible paths: • 32 associated counters are read • Compute 32 possible sums • Select the prediction on last cycle • Checkpoint the 32 possible predictions

Ahead pipelined RHSP (768 Kbits)

Ahead pipelined RHSP • Very limited loss of accuracy for 6-block ahead: • 5 1-bit ahead history are sufficient to discriminate among all the intermediate pathes • Loss of accuracy increases with the length of prediction: • Do not discriminate between all the pathes • explosion of the number of pathes originated from the same X-block ahead block: • Less and less predictions performed by low order counters

Summary • Perceptron based prediction improved: • Prediction accuracy • Use of redundant history • Introduction of skewing • Introduction of history compression • MAC representation: • 16 6-bit adder tree against 256 8-bit mult/acc. tree • X-block ahead RHSP: • on-time prediction without sacrificing accuracy or penalty • misprediction resolution at execution stage

Wide possible design space • For dealing with his/her implementation constraint, the designer can play with: • Number of tables • Width of histories • Compressed/uncompressed ratio • Threshold/width of counters: • Half threshold/ 5 bits counters is not so bad • Use of other MAC representation • 8 counters for 3 bits, 16 counters for 5 bits • ..

Bonus An « objective » comparison of RHSP and 2bcgskew by their (common) inventor

e-gskew 2bc-gskew : logical view

Optimized 2bcgskew • All optimizations in EV8 predictor: • Different history lengthes for all tables • Different hysteresis and prediction table sizes • + a few other tricks: • Sharing predictors and hysteresis tables through banking • Randomly enforcing the flipping of counters on mispredictions to avoid ping-pong phenomena • No « guru » design hash functions: just good functions • 2**(N+11) bits predictor; (N,N,4N,8N) history • (4,4,16,32) for 32Kbit • (9,9,36,72) for 1Mbit

2bcgskew vs RHSP (1) Efficiency of the prediction scheme: • Both can use very long history: • Extra local history prediction brings very poor benefit • Not aware of any other predictor handling such long history • RHSP better tolerates/accomodates compressed history • RHSP captures some extra correlation Efficiency of the storage usage (small size predictors, e.g. 32Kbits): • 2bcgskew more efficient on a few demanding benchmarks: go, gcc95 • RHSP surprisingly efficient on most benchmarks

2bcgskew vs RHSP (2) Accesses to the predictor: • Up to three accesses on RHSP on correct predictions • But not so many, accesses on correct predictions • Single access to prediction, single access to hysteresis on correct predictions on 2bcgskew

2bcgskew vs RHSP (3) • Hardware logic cost: • Adder tree + counter update for RHSP • Hashing functions + small logic for 2bcgskew • Latency: • Table read + adder tree for RHSP • Table read + a few gates for 2bcgskew

That’s the end folks !

RHSP and compressed history

RHSP and compressed history (2)

RHSP and compressed history (3)

RHSP vs 2bcgskewstorage effectiveness (1)

Revisiting the perceptron predictor

Revisiting the perceptron predictor

Presentation Transcript

The Perceptron

Structured Perceptron

Revisiting the vision

Revisiting the Classics

Rosenblatt's Perceptron

The Perceptron Model

Perceptron

Perceptron

THE PREDICTOR

Perceptron

Revisiting the Shelf

Revisiting the OLM’s

Perceptron