1 / 31

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors. Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004. time. Simultaneous Multithreading. SMT [Tullsen95] / Multistreaming [Yamamoto95]

kali
Download Presentation

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004

  2. time Simultaneous Multithreading • SMT [Tullsen95]/ Multistreaming [Yamamoto95] • Instructions from different threads coexist in each processor stage • Resources are shared among different threads • But… • Sharing implies competition • In caches, queues, FUs, … • Fetch policy decides! A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  3. Motivation • SMT performance is limited by fetch performance • A superscalar fetch is not enough to feed an aggressive SMT core • SMT fetch is a bottleneck [Tullsen96] [Burns99] • Straightforward solution: Fetch from several threads each cycle a) Multiple fetch units (1 per thread)  EXPENSIVE! b) Shared fetch + fetch policy [Tullsen96] • Multiple PCs • Multiple branch predictions per cycle • Multiple I-cache accesses per cycle • Does the performance of this fetch organization compensate its complexity? A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  4. Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • Summary & Conclusions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  5. SHIFT&MASK Fetching from a Single Thread (1.X) • Fine-grained, non-simultaneous sharing • Simple  similar to a superscalar fetch unit • No additional HW needed • A fetch policy is needed • Decides fetch priority among threads • Several proposals in the literature Instruction Cache Branch Predictor A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  6. Fetching from a Single Thread (1.X) • But…a single thread is not enough to fill fetch BW • Gshare / hybrid branch predictor + BTB limits fetch width to one basic block per cycle (6-8 instructions) • Fetch BW is heavily underused • Avg 40% wasted with 1.8 • Avg 60% wasted with 1.16 • Fully use the fetch BW • 31% fetch cycles with 1.8 • 6% fetch cycles with 1.16 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  7. 33% 28% Fetching from Multiple Threads (2.X) • Increases fetch throughput • More threads  more possibilities to fill fetch BW • More fetch BW use than 1.X • Fully use the fetch BW • 54% of cycles with 2.8 • 16% of cycles with 2.16 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  8. Instruction Cache BANK 1 2 2 Branch Predictor BANK 2 2 predictions per cycle + 2 ports Replication of SHIFT & MASK logic SHIFT&MASK SHIFT&MASK New HW to realign and merge cache lines Multibanked + multiported instruction cache Fetching from Multiple Threads (2.X) • But…what is the additional HW cost of a 2.X fetch? MERGE A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  9. Our Goal • Can we take the best of both worlds? • Low complexity of a 1.X fetch architecture + • High performance of a 2.X fetch architecture • That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth? A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  10. Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • Summary & Conclusions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  11. High Performance Fetch Engines (I) • We look for high performance • Gshare / hybrid branch predictor + BTB • Low performance • Limit fetch BW to one basic block per cycle • 6-8 instructions • We look for low complexity • Trace cache, Branch Target Address Cache, Collapsing Buffer, etc… • Fetch multiple basic blocks per cycle • 12-16 instructions • High complexity A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  12. High Performance Fetch Engines (II) • Our alternatives • Gskew [Michaud97]+ FTB [Reinman99] • FTB fetch blocks are larger than basic blocks • 5% speedup over gshare+BTB in superscalars • Stream Predictor [Ramirez02] • Streams are larger than FTB fetch blocks • 11% speedup over gskew+FTB in superscalars A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  13. Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • Summary & Conclusions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  14. Simulation Setup • Modified version of SMTSIM [Tullsen96] • Trace-driven, allowing wrong-path execution • Decoupled fetch (1 additional pipeline stage) • Branch predictor sizes of approx. 45KB • Decode & rename width limited to 8 instructions • Fetch width 8/16 inst. • Fetch buffer 32 inst. A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  15. Workloads • SPECint2000 • Code layout optimized • Spike[Cohn97]+ profile data using train input • Most representative 300M instruction trace • Using ref input • Workloads including 2, 4, 6, and 8 threads • Classified according to threads characteristics: • ILP only ILP benchmarks • MEM memory-bounded benchmarks • MIX mix of ILP and MEM benchmarks A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  16. Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • ILP workloads • MEM & MIX workloads • Summary & Conclusions Only for 2 & 4 threads (see paper for the rest) A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  17. ILP Workloads - Fetch Throughput Fetch Throughput • With a given fetch bandwidth, fetching from two threads always benefits fetch performance • Critical point is 1.16 • Stream predictor  Better fetch performance than 2.8 • Gshare+BTB / gskew+FTB  Worse fetch perform. than 2.8 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  18. ILP Workloads – 1.X (1.8) vs 2.X (2.8) Commit Throughput • ILP benchmarks have few memory problems and high parallelism • Fetch unit is the real limiting factor • The higher the fetch throughput, the higher the IPC A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  19. ILP Workloads • So…2.X better than 1.X in ILP workloads… • But, what about 1.2X instead of 2.X? • That is, 1.16 instead of 2.8 • Maintain single thread fetch • Cache lines and buses already 16-instruction wide • We have to modify the HW to select 16 instead of 8 instructions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  20. ILP Workloads – 2.X (2.8) vs 1.2X (1.16) Commit Throughput Similar/Better performance than 2.16! • With 1.16, stream predictor increases throughput (9% avg) • Streams are long enough for a 16-wide fetch • Fetching a single block per cycle is not enough • Gshare+BTB  10% slowdown • Gskew+FTB  4% slowdown A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  21. MEM & MIX Workloads - Fetch Throughput Fetch Throughput • Same trend compared to ILP fetch throughput • For a given fetch BW, fetching from two threads is better • Stream > gskew + FTB > gshare + BTB A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  22. MEM & MIX Workloads – 1.X (1.8) vs 2.X (2.8) Commit Throughput • With memory-bounded benchmarks…overall performance actually decreases!! • Memory-bounded threads monopolize resources for many cycles • Previously identified  New fetch policies • Flush [Tullsen01]or stall [Luo01, El-Mousry03] problematic threads A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  23. MEM & MIX workloads • Fetching from only one thread allows to fetch only from the first, most priority thread • Allows the highest priority thread to proceed with more resources • Avoids low-quality (less priority) threads to monopolize more and more resources on cache misses • Registers, IQ slots, etc. • Only the highest priority thread is fetched • When cache miss is resolved, instructions from the second thread will be consumed • ICOUNT will give it more priority after the cache miss resolution • A powerful fetch unit can be harmful if not well used A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  24. MEM & MIX workloads – 1.X (1.8) vs 1.2X (1.16) Commit Throughput • Even 2.16 has worse commit performance than 1.8 • More interference introduced by low-quality threads • Overall, 1.16 is the best combination • Low complexity  fetching from one thread • High performance  wide fetch A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  25. Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • Summary & Conclusions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  26. Summary • Fetch unit is the most significant obstacle to obtain high SMT performance • However, researchers usually don’t care about SMT fetch performance • They care on how to combine threads to maintain available fetch throughput • A simple gshare/hybrid + BTB is commonly used • Everybody assumes that 2.8 (2.X) is the correct answer • Fetching from many threads can be counterproductive • Sharing implies competing • Low-quality threads monopolize more and more resources A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  27. Conclusions • 1.16 (1.2X) is the best fetch option • Using a high-width fetch architecture • It’s not the prediction accuracy, it’s the fetch width • Beneficial for both ILP and MEM workloads • 1.X is bad for ILP • 2.X is bad for MEM • Fetches only from the most promising thread(according to fetch policy), and as much as possible • Offers the best performance/complexity tradeoff • Fetching from a single thread may require revisiting current SMT fetch policies A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  28. Thanks Questions & Answers

  29. Backup Slides

  30. SMT Workloads A Low-Complexity, High-Performance Fetch Unit for SMT Processors

  31. Simulation Setup A Low-Complexity, High-Performance Fetch Unit for SMT Processors

More Related