300 likes | 687 Views
A 256 Kbits L-TAGE branch predictor . André Seznec IRISA/INRIA/HIPEAC. Directly derived from : A case for (partially) tagged branch predictors , A. Seznec and P. Michaud JILP Feb. 2006 + Tricks: Loop predictor Kernel/user histories. TAGE: TAgged GEometric history length predictors.
E N D
A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC
Directly derived from: A case for (partially) tagged branch predictors, A. Seznec and P. Michaud JILP Feb. 2006 + Tricks: Loop predictor Kernel/user histories
TAGE: TAgged GEometric history length predictors The genesis
Back around 2003 • 2bcgskew was state-of-the-art, but: • but was lagging behind neural inspired predictors on a few benchmarks • Just wanted to get best of both behaviors and maintain: • Reasonable implementation cost: • Use only global history • Medium number of tables • In-time response
The basis : A Multiple length global history predictor TO T1 T2 ? L(0) T3 L(1) L(2) T4 L(3) L(4)
GEometric History Length predictor The set of history lengths forms a geometric series Capture correlation on very long histories {0, 2, 4, 8, 16, 32, 64, 128} most of the storage for short history !! What is important:L(i)-L(i-1) is drastically increasing
Combining multiple predictions ? • Classical solution: • Use of a meta predictor “wasting” storage !?! chosing among 5 or 10 predictions ?? • Neural inspired predictors, Jimenez and Lin 2001 • Use an adder tree instead of a meta-predictor • Partial matching • Use tagged tables and the longest matching history Chen et al 96, Michaud 2005
TO T1 T2 ∑ T3 L(1) L(2) T4 L(3) L(4) CBP-1 (2004): OGEHL Final computation through a sum L(0) Prediction=Sign 12 components 3.670 misp/KI
h[0:L1] pc pc pc h[0:L2] pc h[0:L3] tag tag tag ctr ctr ctr u u u 1 1 1 1 1 1 1 =? =? =? 1 hash hash hash hash hash hash 1 prediction TAGEGeometric history length + PPM-like + optimized update policy Tagless base predictor
Miss Hit Pred =? =? 1 1 1 1 1 1 1 =? 1 Hit 1 Altpred
Prediction computation • General case: • Longest matching component provides the prediction • Special case: • Many mispredictions on newly allocated entries: weak Ctr On many applications, Altpred more accuratethan Pred • Property dynamically monitored through a single 4-bit counter
TAGE update policy • General principle: Minimize the footprint of the prediction. • Just update the longest history matching component and allocate at most one entry on mispredictions
U Tag Ctr A tagged table entry • Ctr: 3-bit prediction counter • U: 2-bit useful counter • Was the entry recently useful ? • Tag: partial tag
Updating the U counter • If (Altpred ≠ Pred) then • Pred = taken : U= U + 1 • Pred ≠ taken : U = U - 1 • Graceful aging: • Periodic shift of all U counters • implemented through the reset of a single bit
Allocating a new entry on a misprediction • Find a single “useless” entry with a longer history: • Priviledge the smallest possible history • To minimize footprint • But not too much • To avoid ping-pong phenomena • Initialize Ctr as weak and U as zero
Improve the global history • Address + conditional branch history: • path confusion on short histories • Address + path: • Direct hashing leads to path confusion • Represent all branches in branch history • Use also path history ( 1 bit per branch, limited to 16 bits)
Design tradeoff for CBP2 (1) • 13 components: • Bring the best accuracy on distributed traces • 8 components not very far ! • History length: • Min=4 , Max = 640 Could use any Min in [2,6] and any Max in [300, 2000]
Design tradeoff for CBP2 (2) • Tag width tradeoff: • (destructive) false match is better tolerated on shorter history • 7 bits on T1 to 15 bits on T12 • Tuning the number of table entries: • Smaller number for very long histories • Smaller number for very short histories
Adding a loop predictor • The loop predictor captures the number of iterations of a loop • When successively encounters 4 times the same number of iterations, the loop predictor provides the prediction. • Advantages: • Very reliable • Small storage budget: 256 52-bit entries • Complexity ? • Might be difficult to manage speculative iteration numbers on deep pipelines
Using a kernel history and a user history • Traces mix user and kernel activities: • Kernel activity after exception • Global history pollution • Solution: use two separate global histories • User history is updated only in user mode • Kernel history is updated in both modes
L-TAGE submission accuracy (distributed traces) 3.314 misp/KI
Reducing L-TAGE complexity • Included 241,5 Kbits TAGE predictor: • 3.368 misp/KI • Loop predictor beneficial only on gzip: Might not be worth the extra complexity
Using less tables • 8 components 256 Kbits TAGE predictor: • 3.446 misp/KI
TAGE prediction computation time ? • 3 successive steps: • Index computation • Table read • Partial match + multiplexor • Does not fit on a single cycle: • But can be ahead pipelined !
Ahead pipelining a global history branch predictor (principle) • Initiate branch prediction X+1 cycles in advance to provide the prediction in time • Use information available: • X-block ahead instruction address • X-block ahead history • To ensure accuracy: • Use intermediate path information
Practice C A B bc Ahead pipelined TAGE: 4// prediction computations Ha A
3-branch ahead pipelined 8 component 256 Kbits TAGE 3.552 misp/KI
A final case for the Geometric History Length predictors • delivers state-of-the-art accuracy • uses only global information: • Very long history: 300+ bits !! • can be ahead pipelined • many effective design points • OGEHL or TAGE • Nb of tables, history lengths