Performance-Aware Speculation Control with Wrong Path Usefulness Prediction

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt HPS Research Group University of Texas at Austin *School of Computer Science Georgia Institute of Technology **Microsoft Research

Outline • Motivation • Mechanism • Experimental Evaluation • Conclusion

Fetch Gating (Pipeline Gating) • Proposed by Manne et al. [ISCA98] • Stops fetching instructions on wrong path to save energy. • Assumes wrong-path instructions do not contribute to performance and consume energy. • Various fetch gating mechanisms • Baniasadi and Moshovos [ISLPED01], Karkhanis et al. [ISLPED02], Aragon et al. [HPCA03], Buyuktosunoglu et al. [GLSVLSI03], Collins et al. [MICRO04]

Limitations of Previous Mechanisms • Hardware complexity • Branch confidence estimator,changes to critical/power-hungry structures. • Additional hardware can offset energy savings due to fetch gating. • Assumption • Wrong-path execution consumes energybut is useless for performance.

Is Wrong Path Execution Really Useless? • Perfect fetch gating parser: Energy consumption decreases by 28% but performance degrades by 5% mcf: Performance degrades by 30% and energy consumption increases by 15% Performance of most benchmarks increases by perfect fetch gating.

Why Does Performance Degrade with Perfect Fetch Gating? MPKI: 36.6 MPKI: 1.5 parser: 37% is used wrong path fills, 14% is unused wrong path fills 5% performance degradation with perfect fetch gating mcf: almost all of wrong-path L2 fills used, memory intensive (MPKI: 36.6) 30% performance degradation with perfect fetch gating Wrong path execution can prefetch useful data Butler [Thesis93], Pierce and Mudge [IPPS94, MICRO96], Mutlu et al. [IEEE TC05]

Why Can Wrong Path ExecutionBe Useful? • From mcf • Hammock structure within a frequently executed loop • BR in BB2 is frequently mispredicted • Since memory latency is large, wrong path prefetching benefit can be significant • Taking into account wrong-path usefulness is important Taken Not-taken ….. BR BB4 BB2 Misprediction recovery Mispredicted BB3 BB4 Load A Load B ….. JMP BB5 Load A Load B ….. L2 cache miss Cache hit Load C ….. BB5 L2 cache miss Cache hit

Our Solution: Performance-Aware Speculation Control • Hardware complexity: Simple low cost fetch gating mechanism • Wrong-path Usefulness: Low cost Wrong Path Usefulness Predictor (WPUP) Performance-Aware Speculation Control Lookup Fetch Gating WPUP Useful Gate Enable Branch Count Fetch Engine Fetch gate only when wrong path execution is useless

Performance-Aware Speculation Control Lookup Fetch Gating WPUP Useful Gate Enable Branch Count Fetch Engine Our Fetch Gating Mechanism • Branch-count based mechanism • More branches  higher chance of misprediction. • Fetch gate if (# of Branches) > Threshold • Mispredictions show phase behavior. • Threshold is determined by branch prediction accuracy for a certain period. • Higher accuracy  Higher threshold • No need for complex logic (e.g. confidence estimator)

Performance-Aware Speculation Control Lookup Fetch Gating WPUP Useful Gate Enable Branch Count Fetch Engine Two WPUP Mechanisms • Branch PC-based WPUP (Fine grained) • Phase-based WPUP (Coarse grained) Can be combined with other fetch gating mechanisms.

Branch PC-based WPUP • Basic idea • Identifies and records conditional branch PCs that lead to useful wrong-path memory references • If the fetched branch is recorded as useful, do not fetch gate

Branch PC-based WPUP • Implementation • Fetch Engine • Latest Branch PC Register (LBPC, 16bits) • LBPC value carried through pipeline • Miss Status Holding Registers (MSHR) • Branch ID field (BID, 10bits) • Already used for branch misprediction recovery • Branch PC field (BPC, 16bits) • Wrong Path field (WP, 1bit) • WPUP Cache • 4 way set-associative, No Data Store, LRU

Branch PC-Based WPUP (Training) LBPC: PC 2 Taken Not-taken Load B in BB3 with PC2 and BID 2 Load C in BB5 with PC 2 and BID 2 Load A in BB3 with PC 2 and BID 2 Load A in BB4 BID 2 from branch unit BB2 ….. BR 2 PC2 : BID 2 L2 cache miss Misprediction recovery Mispredicted BB3 BB4 Load A Load B ….. JMP Load A Load B ….. MSHR A 2 PC2 1 0 B 2 PC2 0 1 Load C ….. BB5 2 C PC2 1 0 MSHR hit; Wrong Path was useful. BPC 2 is stored in WPUP cache.

Branch PC-Based WPUP (Prediction) LBPC: PC 2 Taken Not-taken Fetch Gate? Fetch Gate? BB2 ….. BR 2 PC2 : Mispredicted BB3 BB4 Load A Load B ….. JMP Load A Load B ….. WPUP Cache Wrong-path Execution PC2 …… …… Load C ….. BB5 …… Hit; Do not fetch gate.

Phase-based WPUP • Basic idea • Predict if the current phase will provide useful wrong-path memory references • If so, do not fetch gate

Phase-based WPUP • Implementation • Wrong Path Usefulness Counter (WPUC, 5bits) • Incremented for each useful wrong-path memory reference • Reset periodically • Do not fetch gate if WPUC > threshold • BPC fields or WPUP cache not needed

Simulation Methodology • Alpha ISA execution driven simulator • Baseline processor configuration • 2GHz, 8-wide issue, out-of-order, 128-entry ROB • Hybrid branch predictor (64K-entry gshare and 64K-entry PAs) • 11 stages (minimum branch misprediction penalty) • 1MB, 8-way unified L2 cache • 32 L2 MSHRs, 300 cycle memory latency • Stream prefetcher • Wattch power model: 100 nm, 1.2V technology • Manne’s fetch gating • Gating threshold: 3 low confidence branches • JRS confidence estimator (4K-entry, 4bit-MDC) • Tuned for the best energy-delay product • Branch Count-based fetch gating

Branch-Count Based Fetch Gating Manne’s and our fetch gating degrade performance of mcf and parser Performance and energy savings are higher than Manne’s.

WPUP Mechanisms Improves performance and energy savings compared to Manne’s Improves performance of mcf and parser

Hardware Cost Performance-Aware Speculation Control vs.Manne’s Fetch Gating

Comparison with Manne’s Fetch Gating WPUPs improve performance and energy efficiency of Manne’s 2.5% less performance degradation, 1.0% more energy savings

Energy-Delay Product Improves Energy-Delay Product (2.6% compared to Manne’s)

Conclusion • Performance-Aware Speculation Control • Branch count-based fetch gating • Simple and low cost. • Introduced Wrong Path Usefulness Prediction • Recovers performance loss due to fetch gating by executing useful wrong-path instructions. • Can be combined with other fetch gating mechanisms. • Reduces performance loss due to fetch gating and also saves energy.

Questions?

Performance-Aware Speculation Control with Wrong Path Usefulness Prediction