210 likes | 380 Views
A 64 Kbytes ITTAGE indirect branch predictor. André Seznec INRIA/IRISA. Build on ITTAGE. ITTAGE: Introduced at the same time as TAGE (JILP 2006) Derived directly from the TAGE predictor : Target prediction instead of direction prediction.
E N D
A 64 Kbytes ITTAGE indirect branchpredictor André Seznec INRIA/IRISA
Build on ITTAGE • ITTAGE: • Introducedat the same time as TAGE (JILP 2006) • Deriveddirectlyfrom the TAGE predictor: • Target predictioninstead of direction prediction
ITTAGE: multiple tables, global history predictor The set of history lengths forms a geometric series Capture correlation on very long histories {0, 2, 4, 8, 16, 32, 64, 128} most of the storage for short history !! What is important:L(i)-L(i-1) is drastically increasing
The ITTAGE predictor h[0:L1] pc h[0:L3] pc pc h[0:L2] pc 32 32 1 32 1 32 1 =? =? =? 32 32 prediction Tagless base Predictor
Prediction computation • General case: • Longest matching component provides the prediction • Special case: • Many mispredictions on newly allocated entries: weak Ctr • Sometimes Altpred (slightly) more accuratethan Pred • Property dynamically monitored through a single 4-bit counter -2 % MPPKI
A tagged table entry • Ctr: 2-bit hysteresis counter • U: 1-bit useful counter • Was the entry recently useful ? • Tag: partial tag • Target: the target Target Tag Ctr U 32 bits or someway to reconstructit
Allocate entries on mispredictions • Allocateentries in longer historylength tables • On tables with U unset • Set Ctr to Weak and U to 0 • HUGE STORAGE BUDGET: • Up to 3 entries allocated in different tables • Fastwarming
Managing the (U)seful bit • Setting whenavoids a misprediction • (Pred = target) & (Alt ≠ target) • Global reset when « difficulties » to allocate • Dynamically monitor if more failuresthansuccesses on allocations
Most of the storagespace for targets • 32 bits per entry !! • More than 12K (PC,target) pairs on CLIENT05 • But only a maximum of 4038 differenttargets • Use 12 bit pointers + a 4K table
Let us berealistic: leveragetargetlocality • All targets in atmost 90 256KB regions • Use a 128-entry region table: • Fully associative, 240 bytes • Saves 7 bits per ITTAGE entry • Would have saved 39 bits on a 64-bit architecture !!
Target Tag Ctr U Region pointer Region offset
The global history • Conventional global branchhistory • 10 bits for indirect jumps, 5 bits for calls • mixingtarget and PC -16 % MPPKI
The global history (2) • Including all branches ? • Only indirect and calls: -2.5 % MPPKI • But no conclusion: • without 2 branches on INT05 and INT06 just the otherway
+ the other tricks (for TAGE) • Immediate Update Mimicker • Storage spaceinterleaving • Picking the best set of historylengths -1 % MPPKI
The Immediate Update Mimicker • Issue: • Somemispredictions due to late updates at retirement • Immediate Update Mimicker: • Try to catch these cases
The Immediate Update Mimicker Fetch P(rediction) T(able) A(ddress in the table) P T A P T A P T A P T A P T A P T A E T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A P T A E T A Misprediction Same table, same entry
For the competition: interleaving h[0,L1] h[0,L1] Xbar Xbar =? =? =? prediction
For the competition Guidedselection of the best set of historylengths: 4Kentries: 0, 4Kentries: 0, 10, 4Kentries: 16, 27, 44, 60, 96, 109, 219, 449, 2Kentries: 487, 714, 1313, 2146, 3881 Remember: 10 bits per indirect, 5 per call
Whereis the limit ? • Lessthan 3 % MPPKI • Whydidyou not use the « 12-bit pointer » trick ? • Just winning 0.5 % MPPKI
Summary • ITTAGE directlyderivedfrom TAGE • Historyshouldinclude (PC+target) for indirect and calls • Locality on targetscanbeleveraged • Marginal tricks not really worth