1 / 26

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design. Matt Ramsay, Chris Feucht & Mikko H. Lipasti University of Wisconsin-Madison PHARM Team www.ece.wisc.edu/~pharm. Introduction & Motivation. Two main performance limitations: Memory stalls Pipeline flushes due to incorrect speculation

arnold
Download Presentation

Exploring Efficient SMT Branch Predictor Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti University of Wisconsin-Madison PHARM Team www.ece.wisc.edu/~pharm Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  2. Introduction & Motivation • Two main performance limitations: • Memory stalls • Pipeline flushes due to incorrect speculation • In SMTs: • Multiple threads to hide these problems • However, multiple threads make speculation harder because of interference with shared prediction resources • This interference can cause more branch mispredicts and thus limit potential performance Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  3. Introduction & Motivation • We study: • Providing each thread with its own pieces of the branch predictor to eliminate interference between threads • Apply these changes to different branch prediction schemes to evaluate their performance • We hypothesize: • Elimination of thread interference in the branch predictor will improve prediction accuracy • Thread-level parallelism in an SMT makes branch prediction accuracy much less important than in a single-threaded processor Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  4. Talk Outline • Introduction & Motivation • SMT Overview • Branch Prediction Overview • Test Methodology • Results • Conclusions Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  5. SMT Overview • Simultaneous Multithreading • Machines often have more resources than can be used by one thread • SMT: Allows TLP along with ILP • 4-wide example: Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  6. Tested Predictors • Static Predictors (in paper): • Always Taken • Backward-Taken-Forward-Not-Taken • 2-Bit Predictor: • Branch History Table (BHT) indexed by PC of branch instruction • Allows for significant aliasing by branches that share low bits of PC • Does not take advantage of global branch history information • Gshare Predictor: • BHT indexed by XOR of the branch PC and the global branch history • Hashing reduces aliasing • Correlates prediction based on global branch behavior Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  7. YAGS Predictor Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  8. Indirect Branch Predictor • Predicts the target of Jump-Register (JR) instructions • Prediction table holds target addresses • Larger table entries lead to more aliasing • Indexed like Gshare branch predictor • Split indirect predictor caused little change in branch prediction accuracy and overall performance (in paper) Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  9. Talk Outline • Introduction & Motivation • SMT Overview • Branch Prediction Overview • Test Methodology • Results • Conclusions Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  10. # of Threads = 4 # of Address Spaces = 4 # Bits in Branch History = 12 # of BT Entries = 4096 # Bits in Indirect History = 10 # of IT Entries = 1024 Machine Width = 4 Pipeline Depth = 15 Max Issue Window = 64 # of Physical Registers = 512 # Instructions Simulated = ~40M L1 Latency = 1 cycle L2 Latency = 10 cycles Mem Latency = 200 cycles L1 Size = 32 KB L1 Associativity = D.M. L1 Block Size = 64 B L2 Size = 1MB L2 Associativity = 4 L2 Block Size = 128 B Simulation Environment • Multithreaded version of SimpleScalar developed by Craig Zilles at UW • Machine Configuration: Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  11. Benchmarks Tested • From SpecCPU2000 • INT • crafty • gcc • FP • ammp • equake • Benchmark Configurations • Heterogeneous Threads: Each thread runs one of the listed benchmarks to simulate a multi-tasking environment • Homogeneous Threads: Each thread runs a different copy of the same benchmark (crafty) to simulate a multithreaded server environment Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  12. Shared Configuration Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  13. Predictor History Predictor History Thread 0 Thread 1 Predictor Thread 2 Thread 3 History Predictor History Split Branch Configuration • Predictor block retains original size when duplicated Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  14. Split Branch Table Configuration Thread ID Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  15. Split History Configuration Thread ID Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  16. Talk Outline • Introduction & Motivation • SMT Overview • Branch Prediction Overview • Test Methodology • Results • Conclusions Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  17. Split Branch Predictor Accuracy • Full predictor split: Predictors act as expected, as they would in a single threaded environment Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  18. SharedBranch Predictor Accuracy • Shared predictor: Performance suffers because of interference by other threads (esp. Gshare) Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  19. Prediction Accuracy: Heterogeneous Threads • Yags & Gshare: • Sharing the history register performs very poorly • Split history configuration performs almost as well as the split branch configuration while using significantly less resources • 2-Bit: splitting the predictor performs better, mispredicts reduced from 9.52% to 8.35% Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  20. Prediction Accuracy: Homogeneous Threads • Yags & Gshare: • Configurations perform similarly to heterogeneous thread case • Split history configuration performs even closer to split branch configuration because of positive aliasing in the BHT • Surprisingly, splitting portions of the predictor still performs better even when each thread runs the same program Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  21. Per Thread CPI: Heterogeneous Threads • Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts) • Split history configuration produces almost same performance as split branch configuration while using significantly less resources Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  22. Per Thread CPI: Homogeneous Threads • Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache misses Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  23. Performance Across Predictors • Branch prediction scheme has little effect on performance • Only 2.75% and 5% CPI increases when Gshare and 2-bit predictors are used instead of much more expensive YAGS • Increases are 6% and 11% in a single-threaded machine • Heterogeneous thread configuration performs similarly Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  24. Performance Across Predictors • Split history configuration still allows performance to hold for simpler schemes • 4% and 6.25% CPI increases for Gshare and 2-bit schemes compared to YAGS • Simpler schemes allow for reduced cycle time and power consumption • CPI numbers only close estimates because simulations are not deterministic Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  25. Talk Outline • Introduction & Motivation • SMT Overview • Branch Prediction Overview • Test Methodology • Results • Conclusions Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

  26. Conclusions • Multithreaded execution interferes with branch prediction accuracy • Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test cases • Splitting only the branch history has best branch prediction accuracy and performance per resource • Performance (CPI) is relatively stable, even when branch prediction structure is simplified Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

More Related