A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004

time Simultaneous Multithreading • SMT [Tullsen95]/ Multistreaming [Yamamoto95] • Instructions from different threads coexist in each processor stage • Resources are shared among different threads • But… • Sharing implies competition • In caches, queues, FUs, … • Fetch policy decides! A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Motivation • SMT performance is limited by fetch performance • A superscalar fetch is not enough to feed an aggressive SMT core • SMT fetch is a bottleneck [Tullsen96] [Burns99] • Straightforward solution: Fetch from several threads each cycle a) Multiple fetch units (1 per thread)  EXPENSIVE! b) Shared fetch + fetch policy [Tullsen96] • Multiple PCs • Multiple branch predictions per cycle • Multiple I-cache accesses per cycle • Does the performance of this fetch organization compensate its complexity? A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • Summary & Conclusions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

SHIFT&MASK Fetching from a Single Thread (1.X) • Fine-grained, non-simultaneous sharing • Simple  similar to a superscalar fetch unit • No additional HW needed • A fetch policy is needed • Decides fetch priority among threads • Several proposals in the literature Instruction Cache Branch Predictor A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Fetching from a Single Thread (1.X) • But…a single thread is not enough to fill fetch BW • Gshare / hybrid branch predictor + BTB limits fetch width to one basic block per cycle (6-8 instructions) • Fetch BW is heavily underused • Avg 40% wasted with 1.8 • Avg 60% wasted with 1.16 • Fully use the fetch BW • 31% fetch cycles with 1.8 • 6% fetch cycles with 1.16 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

33% 28% Fetching from Multiple Threads (2.X) • Increases fetch throughput • More threads  more possibilities to fill fetch BW • More fetch BW use than 1.X • Fully use the fetch BW • 54% of cycles with 2.8 • 16% of cycles with 2.16 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Instruction Cache BANK 1 2 2 Branch Predictor BANK 2 2 predictions per cycle + 2 ports Replication of SHIFT & MASK logic SHIFT&MASK SHIFT&MASK New HW to realign and merge cache lines Multibanked + multiported instruction cache Fetching from Multiple Threads (2.X) • But…what is the additional HW cost of a 2.X fetch? MERGE A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Our Goal • Can we take the best of both worlds? • Low complexity of a 1.X fetch architecture + • High performance of a 2.X fetch architecture • That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth? A Low-Complexity, High-Performance Fetch Unit for SMT Processors

High Performance Fetch Engines (I) • We look for high performance • Gshare / hybrid branch predictor + BTB • Low performance • Limit fetch BW to one basic block per cycle • 6-8 instructions • We look for low complexity • Trace cache, Branch Target Address Cache, Collapsing Buffer, etc… • Fetch multiple basic blocks per cycle • 12-16 instructions • High complexity A Low-Complexity, High-Performance Fetch Unit for SMT Processors

High Performance Fetch Engines (II) • Our alternatives • Gskew [Michaud97]+ FTB [Reinman99] • FTB fetch blocks are larger than basic blocks • 5% speedup over gshare+BTB in superscalars • Stream Predictor [Ramirez02] • Streams are larger than FTB fetch blocks • 11% speedup over gskew+FTB in superscalars A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Simulation Setup • Modified version of SMTSIM [Tullsen96] • Trace-driven, allowing wrong-path execution • Decoupled fetch (1 additional pipeline stage) • Branch predictor sizes of approx. 45KB • Decode & rename width limited to 8 instructions • Fetch width 8/16 inst. • Fetch buffer 32 inst. A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Workloads • SPECint2000 • Code layout optimized • Spike[Cohn97]+ profile data using train input • Most representative 300M instruction trace • Using ref input • Workloads including 2, 4, 6, and 8 threads • Classified according to threads characteristics: • ILP only ILP benchmarks • MEM memory-bounded benchmarks • MIX mix of ILP and MEM benchmarks A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Talk Outline • Motivation • Fetch Architectures for SMT • High-Performance Fetch Engines • Simulation Setup • Results • ILP workloads • MEM & MIX workloads • Summary & Conclusions Only for 2 & 4 threads (see paper for the rest) A Low-Complexity, High-Performance Fetch Unit for SMT Processors

ILP Workloads - Fetch Throughput Fetch Throughput • With a given fetch bandwidth, fetching from two threads always benefits fetch performance • Critical point is 1.16 • Stream predictor  Better fetch performance than 2.8 • Gshare+BTB / gskew+FTB  Worse fetch perform. than 2.8 A Low-Complexity, High-Performance Fetch Unit for SMT Processors

ILP Workloads – 1.X (1.8) vs 2.X (2.8) Commit Throughput • ILP benchmarks have few memory problems and high parallelism • Fetch unit is the real limiting factor • The higher the fetch throughput, the higher the IPC A Low-Complexity, High-Performance Fetch Unit for SMT Processors

ILP Workloads • So…2.X better than 1.X in ILP workloads… • But, what about 1.2X instead of 2.X? • That is, 1.16 instead of 2.8 • Maintain single thread fetch • Cache lines and buses already 16-instruction wide • We have to modify the HW to select 16 instead of 8 instructions A Low-Complexity, High-Performance Fetch Unit for SMT Processors

ILP Workloads – 2.X (2.8) vs 1.2X (1.16) Commit Throughput Similar/Better performance than 2.16! • With 1.16, stream predictor increases throughput (9% avg) • Streams are long enough for a 16-wide fetch • Fetching a single block per cycle is not enough • Gshare+BTB  10% slowdown • Gskew+FTB  4% slowdown A Low-Complexity, High-Performance Fetch Unit for SMT Processors

MEM & MIX Workloads - Fetch Throughput Fetch Throughput • Same trend compared to ILP fetch throughput • For a given fetch BW, fetching from two threads is better • Stream > gskew + FTB > gshare + BTB A Low-Complexity, High-Performance Fetch Unit for SMT Processors

MEM & MIX Workloads – 1.X (1.8) vs 2.X (2.8) Commit Throughput • With memory-bounded benchmarks…overall performance actually decreases!! • Memory-bounded threads monopolize resources for many cycles • Previously identified  New fetch policies • Flush [Tullsen01]or stall [Luo01, El-Mousry03] problematic threads A Low-Complexity, High-Performance Fetch Unit for SMT Processors

MEM & MIX workloads • Fetching from only one thread allows to fetch only from the first, most priority thread • Allows the highest priority thread to proceed with more resources • Avoids low-quality (less priority) threads to monopolize more and more resources on cache misses • Registers, IQ slots, etc. • Only the highest priority thread is fetched • When cache miss is resolved, instructions from the second thread will be consumed • ICOUNT will give it more priority after the cache miss resolution • A powerful fetch unit can be harmful if not well used A Low-Complexity, High-Performance Fetch Unit for SMT Processors

MEM & MIX workloads – 1.X (1.8) vs 1.2X (1.16) Commit Throughput • Even 2.16 has worse commit performance than 1.8 • More interference introduced by low-quality threads • Overall, 1.16 is the best combination • Low complexity  fetching from one thread • High performance  wide fetch A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Summary • Fetch unit is the most significant obstacle to obtain high SMT performance • However, researchers usually don’t care about SMT fetch performance • They care on how to combine threads to maintain available fetch throughput • A simple gshare/hybrid + BTB is commonly used • Everybody assumes that 2.8 (2.X) is the correct answer • Fetching from many threads can be counterproductive • Sharing implies competing • Low-quality threads monopolize more and more resources A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Conclusions • 1.16 (1.2X) is the best fetch option • Using a high-width fetch architecture • It’s not the prediction accuracy, it’s the fetch width • Beneficial for both ILP and MEM workloads • 1.X is bad for ILP • 2.X is bad for MEM • Fetches only from the most promising thread(according to fetch policy), and as much as possible • Offers the best performance/complexity tradeoff • Fetching from a single thread may require revisiting current SMT fetch policies A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Thanks Questions & Answers

Backup Slides

SMT Workloads A Low-Complexity, High-Performance Fetch Unit for SMT Processors

Simulation Setup A Low-Complexity, High-Performance Fetch Unit for SMT Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors