Maximizing On-Chip Parallelism Through Simultaneous Multithreading

CS 7960-4 Lecture 18 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Processor Under-Utilization • Wide gap between average processor utilization • and peak processor utilization • Caused by dependences, long latency instrs, • branch mispredicts • Results in many idle cycles for many structures

Superscalar Utilization Thread-1 Time V waste H waste • Suffers from horizontal waste • (can’t find enough work in a cycle) • and vertical waste (because of • dependences, there is nothing to • do for many cycles) • Utilization=19% • vertical:horizontal waste = 61:39 Resources (e.g. FUs)

Chip Multiprocessors Thread-1 Thread-2 Time V waste H waste • Single-thread performance goes • down • Horizontal waste reduces Resources (e.g. FUs)

Fine-Grain Multithreading Thread-1 Thread-2 Time V waste H waste • Low-cost context-switch at a fine • grain • Reduces vertical waste Resources (e.g. FUs)

Simultaneous Multithreading Thread-1 Thread-2 Time V waste H waste • Reduces vertical and horizontal • waste Resources (e.g. FUs)

Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs

Chip Multi-Processor Private Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Exec Engine Exec Engine Exec Engine Exec Engine Regs IQ Private Exec Engine DCache FUs

SMT Vs. CMP • In CMP, each thread has 50 regs, in-flight window • of 18, 10 issue queue entries, 8KB data cache • In SMT, two threads can have 60 regs, in-flight • window of 28, 15 issue queue entries, 12KB cache; • two threads have 40 regs, window of 8, 5 issue • queue entries, 4KB cache • CMP is easier to design, has higher clock speed (?) • SMT has higher utilization, single-thread IPC

Impact on Clock Speed • To support four threads, you need four times as • many registers, and potentially larger queues and • caches – in other words, the execution engine is • much larger • The execution engine can be clustered • SMT and CMP start looking very similar • SMT allows a thread to span multiple clusters • SMT improves performance by investing in interconnects between processing units

Clustered SMT Front End Front End Front End Front End Exec engines

Evaluated Models • Fine-Grained Multithreading • Unrestricted SMT • Restricted SMT • X-issue: A thread can only issue up to X instrs in a cycle • Limited connection: each thread is tied to a fixed FU

Results • SMT nearly eliminates horizontal waste • In spite of priorities, single-thread performance degrades (cache contention) • Not much difference between private and shared caches – however, with • few threads, the private caches go under-utilized

Comparison of Models • Bullet

CMP vs. SMT • Bullet

Next Class’ Paper • “Exploiting Choice: Instruction Fetch and Issue • on an Implementable Simultaneous Multithreading • Processor”, D.M. Tullsen, S.J. Eggers, J.S. Emer, • H.M. Levy, J.L. Lo, R.M. Stamm, Proceedings of • ISCA-23, May 1996

Title • Bullet

Maximizing On-Chip Parallelism Through Simultaneous Multithreading

Maximizing On-Chip Parallelism Through Simultaneous Multithreading

Presentation Transcript

CS 115 Lecture 18

CS 160: Lecture 18

CS 160: Lecture 18

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 7960-4 Lecture 8

CS 7960-4 Lecture 5

CS 425 Lecture 4

CS 311 – Lecture 18 Outline

CS 7960-4 Lecture 23

CS 584 Lecture 18

CS 7960-4 Lecture 2

CS 7960-4 Lecture 17

CS 160: Lecture 18

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 14