290 likes | 304 Views
Learn how Simultaneous Multithreading (SMT) maximizes on-chip parallelism, reduces waste, and improves throughput in processor designs. Discover strategies to minimize idle cycles and optimize performance.
E N D
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995
Processor Under-Utilization • Wide gap between average processor utilization • and peak processor utilization • Caused by dependences, long latency instrs, • branch mispredicts • Results in many idle cycles for many structures
Superscalar Utilization Thread-1 Time V waste H waste • Suffers from horizontal waste • (can’t find enough work in a cycle) • and vertical waste (because of • dependences, there is nothing to • do for many cycles) • Utilization=19% • vertical:horizontal waste = 61:39 Resources (e.g. FUs)
Chip Multiprocessors Thread-1 Thread-2 Time V waste H waste • Single-thread performance goes • down • Horizontal waste reduces Resources (e.g. FUs)
Fine-Grain Multithreading Thread-1 Thread-2 Time V waste H waste • Low-cost context-switch at a fine • grain • Reduces vertical waste Resources (e.g. FUs)
Simultaneous Multithreading Thread-1 Thread-2 Time V waste H waste • Reduces vertical and horizontal • waste Resources (e.g. FUs)
Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs What about RAS, LSQ?
Chip Multi-Processor Private Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Exec Engine Exec Engine Exec Engine Exec Engine Regs IQ Private Exec Engine DCache FUs
Clustered SMT Front End Front End Front End Front End Clusters
Evaluated Models • Fine-Grained Multithreading • Unrestricted SMT • Restricted SMT • X-issue: A thread can only issue up to X instrs in a cycle • Limited connection: each thread is tied to a fixed FU
Results • SMT nearly eliminates horizontal waste • In spite of priorities, single-thread performance degrades (cache contention) • Not much difference between private and shared caches – however, with • few threads, the private caches go under-utilized
Comparison of Models • Bullet
CS 7810 Lecture 16 Exploiting Choice: Instruction Fetch and Issue on an Implementable SMT Processor D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm Proceedings of ISCA-23 June 1996
New Bottlenecks • Instruction fetch has a strong influence on total • throughput • if the execution engine is executing at top speed, it is often hungry for new instrs • some threads are more likely to have ready instrs than others – selection becomes important
SMT Processor Multiple RAS Multiple PCs More registers Multiple Renames and ROBs
SMT Overheads • Large register file – need at least 256 physical • registers to support eight threads • increases cycle time/pipeline depth • increases mispredict penalty • increases bypass complexity • increases register lifetime • Results in 2% performance loss
Base Design • Front-end is fine-grain multithreaded, rest is SMT • Bottlenecks: • Low fetch rate (4.2 instrs/cycle) • IQ is often full, but only half the issue bandwidth is being used
Fetch Efficiency • Base case uses RoundRobin.1.8 • RR.2.4: fetches four instrs each from two threads • requires a banked organization • requires additional multiplexing logic • Increases the chances of finding eight instrs without • a taken branch • Yields instrs in spite of an I-cache miss • RR.2.8: extends RR.2.4 by reading out larger line
Fetch Effectiveness • Are we picking the best instructions? • IQ-clog: instrs that sit in the issue queue for ages; • does it make sense to fetch their dependents? • Wrong-path instructions waste issue slots • Ideally, we want useful instructions that have short • issue queue lifetimes
Fetch Effectiveness • Useful instructions: throttle fetch if branch mpred • probability is high confidence, num-branches • (BRCOUNT), in-flight window size • Short lifetimes: throttle fetch if you encounter a • cache miss (MISSCOUNT), give priority to threads • that have young instrs (IQPOSN)
ICOUNT • ICOUNT: priority is based on number of unissued • instrs everyone gets a share of the issueq • Long-latency instructions will not dominate the IQ • Threads that have high issue rate will also have • high fetch rate • In-flight windows are short and wrong-path instrs • are minimized • Increased fairness more ready instrs per cycle
Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)
Reducing IQ-clog • IQBUF: a buffer before the issue queue • ITAG: pre-examine the tags to detect I-cache • misses and not waste fetch bandwidth • OPT_last and SPEC_last: lower issue priority • for speculative instrs • These techniques entail overheads and result in • minor improvements
Bottleneck Analysis • The following are not bottlenecks: issue bandwidth, • issue queue size, memory thruput • Doubling fetch bandwidth improves thruput by • 8% -- there is still room for improvement • SMT is more tolerant of branch mpreds: perfect • prediction improves 1-thread by 25% and 8-thread • by 9% -- no speculation has a similar effect • Register file can be a huge bottleneck
Power and Energy • Energy is heavily influenced by “work done” and • by execution time compared to a single-thread • machine, SMT does not reduce “work done”, but • reduces execution time reduced energy • Same work, less time higher power!
Title • Bullet