Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti University of Wisconsin-Madison PHARM Team www.ece.wisc.edu/~pharm Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Introduction & Motivation • Two main performance limitations: • Memory stalls • Pipeline flushes due to incorrect speculation • In SMTs: • Multiple threads to hide these problems • However, multiple threads make speculation harder because of interference with shared prediction resources • This interference can cause more branch mispredicts and thus limit potential performance Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Introduction & Motivation • We study: • Providing each thread with its own pieces of the branch predictor to eliminate interference between threads • Apply these changes to different branch prediction schemes to evaluate their performance • We hypothesize: • Elimination of thread interference in the branch predictor will improve prediction accuracy • Thread-level parallelism in an SMT makes branch prediction accuracy much less important than in a single-threaded processor Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Talk Outline • Introduction & Motivation • SMT Overview • Branch Prediction Overview • Test Methodology • Results • Conclusions Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

SMT Overview • Simultaneous Multithreading • Machines often have more resources than can be used by one thread • SMT: Allows TLP along with ILP • 4-wide example: Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Tested Predictors • Static Predictors (in paper): • Always Taken • Backward-Taken-Forward-Not-Taken • 2-Bit Predictor: • Branch History Table (BHT) indexed by PC of branch instruction • Allows for significant aliasing by branches that share low bits of PC • Does not take advantage of global branch history information • Gshare Predictor: • BHT indexed by XOR of the branch PC and the global branch history • Hashing reduces aliasing • Correlates prediction based on global branch behavior Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

YAGS Predictor Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Indirect Branch Predictor • Predicts the target of Jump-Register (JR) instructions • Prediction table holds target addresses • Larger table entries lead to more aliasing • Indexed like Gshare branch predictor • Split indirect predictor caused little change in branch prediction accuracy and overall performance (in paper) Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

# of Threads = 4 # of Address Spaces = 4 # Bits in Branch History = 12 # of BT Entries = 4096 # Bits in Indirect History = 10 # of IT Entries = 1024 Machine Width = 4 Pipeline Depth = 15 Max Issue Window = 64 # of Physical Registers = 512 # Instructions Simulated = ~40M L1 Latency = 1 cycle L2 Latency = 10 cycles Mem Latency = 200 cycles L1 Size = 32 KB L1 Associativity = D.M. L1 Block Size = 64 B L2 Size = 1MB L2 Associativity = 4 L2 Block Size = 128 B Simulation Environment • Multithreaded version of SimpleScalar developed by Craig Zilles at UW • Machine Configuration: Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Benchmarks Tested • From SpecCPU2000 • INT • crafty • gcc • FP • ammp • equake • Benchmark Configurations • Heterogeneous Threads: Each thread runs one of the listed benchmarks to simulate a multi-tasking environment • Homogeneous Threads: Each thread runs a different copy of the same benchmark (crafty) to simulate a multithreaded server environment Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Shared Configuration Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Predictor History Predictor History Thread 0 Thread 1 Predictor Thread 2 Thread 3 History Predictor History Split Branch Configuration • Predictor block retains original size when duplicated Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Split Branch Table Configuration Thread ID Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Split History Configuration Thread ID Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Split Branch Predictor Accuracy • Full predictor split: Predictors act as expected, as they would in a single threaded environment Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

SharedBranch Predictor Accuracy • Shared predictor: Performance suffers because of interference by other threads (esp. Gshare) Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Prediction Accuracy: Heterogeneous Threads • Yags & Gshare: • Sharing the history register performs very poorly • Split history configuration performs almost as well as the split branch configuration while using significantly less resources • 2-Bit: splitting the predictor performs better, mispredicts reduced from 9.52% to 8.35% Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Prediction Accuracy: Homogeneous Threads • Yags & Gshare: • Configurations perform similarly to heterogeneous thread case • Split history configuration performs even closer to split branch configuration because of positive aliasing in the BHT • Surprisingly, splitting portions of the predictor still performs better even when each thread runs the same program Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Per Thread CPI: Heterogeneous Threads • Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts) • Split history configuration produces almost same performance as split branch configuration while using significantly less resources Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Per Thread CPI: Homogeneous Threads • Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache misses Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Performance Across Predictors • Branch prediction scheme has little effect on performance • Only 2.75% and 5% CPI increases when Gshare and 2-bit predictors are used instead of much more expensive YAGS • Increases are 6% and 11% in a single-threaded machine • Heterogeneous thread configuration performs similarly Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Performance Across Predictors • Split history configuration still allows performance to hold for simpler schemes • 4% and 6.25% CPI increases for Gshare and 2-bit schemes compared to YAGS • Simpler schemes allow for reduced cycle time and power consumption • CPI numbers only close estimates because simulations are not deterministic Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Conclusions • Multithreaded execution interferes with branch prediction accuracy • Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test cases • Splitting only the branch history has best branch prediction accuracy and performance per resource • Performance (CPI) is relatively stable, even when branch prediction structure is simplified Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design

Presentation Transcript

A 256 Kbits L-TAGE branch predictor

The O-GEHL branch predictor

A Penalty-Sensitive Branch Predictor

Global-Local Combined Branch History The Alternative Way to Improve TAGE Branch Predictor

Temporal Stream Branch Predictor (TS Predictor)

A 64 Kbytes ITTAGE indirect branch predictor

An Efficient SMT Solver

Bimode Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor

Branch Predictor Interface

Exploring Correlation for Indirect Branch Prediction

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Exploring Web Design

A 256 Kbits L-TAGE branch predictor

Branch Predictor Design for AE64000

Branch Design

Design Special SMT Nozzles at QYSMT

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Energy Efficient Design

The O-GEHL branch predictor