170 likes | 188 Views
CS 7960-4 Lecture 18. Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995. Processor Under-Utilization. Wide gap between average processor utilization and peak processor utilization
E N D
CS 7960-4 Lecture 18 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995
Processor Under-Utilization • Wide gap between average processor utilization • and peak processor utilization • Caused by dependences, long latency instrs, • branch mispredicts • Results in many idle cycles for many structures
Superscalar Utilization Thread-1 Time V waste H waste • Suffers from horizontal waste • (can’t find enough work in a cycle) • and vertical waste (because of • dependences, there is nothing to • do for many cycles) • Utilization=19% • vertical:horizontal waste = 61:39 Resources (e.g. FUs)
Chip Multiprocessors Thread-1 Thread-2 Time V waste H waste • Single-thread performance goes • down • Horizontal waste reduces Resources (e.g. FUs)
Fine-Grain Multithreading Thread-1 Thread-2 Time V waste H waste • Low-cost context-switch at a fine • grain • Reduces vertical waste Resources (e.g. FUs)
Simultaneous Multithreading Thread-1 Thread-2 Time V waste H waste • Reduces vertical and horizontal • waste Resources (e.g. FUs)
Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs
Chip Multi-Processor Private Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Exec Engine Exec Engine Exec Engine Exec Engine Regs IQ Private Exec Engine DCache FUs
SMT Vs. CMP • In CMP, each thread has 50 regs, in-flight window • of 18, 10 issue queue entries, 8KB data cache • In SMT, two threads can have 60 regs, in-flight • window of 28, 15 issue queue entries, 12KB cache; • two threads have 40 regs, window of 8, 5 issue • queue entries, 4KB cache • CMP is easier to design, has higher clock speed (?) • SMT has higher utilization, single-thread IPC
Impact on Clock Speed • To support four threads, you need four times as • many registers, and potentially larger queues and • caches – in other words, the execution engine is • much larger • The execution engine can be clustered • SMT and CMP start looking very similar • SMT allows a thread to span multiple clusters • SMT improves performance by investing in interconnects between processing units
Clustered SMT Front End Front End Front End Front End Exec engines
Evaluated Models • Fine-Grained Multithreading • Unrestricted SMT • Restricted SMT • X-issue: A thread can only issue up to X instrs in a cycle • Limited connection: each thread is tied to a fixed FU
Results • SMT nearly eliminates horizontal waste • In spite of priorities, single-thread performance degrades (cache contention) • Not much difference between private and shared caches – however, with • few threads, the private caches go under-utilized
Comparison of Models • Bullet
CMP vs. SMT • Bullet
Next Class’ Paper • “Exploiting Choice: Instruction Fetch and Issue • on an Implementable Simultaneous Multithreading • Processor”, D.M. Tullsen, S.J. Eggers, J.S. Emer, • H.M. Levy, J.L. Lo, R.M. Stamm, Proceedings of • ISCA-23, May 1996
Title • Bullet