1 / 17

Maximizing On-Chip Parallelism Through Simultaneous Multithreading

Understand the benefits of simultaneous multithreading in maximizing on-chip parallelism and reducing waste cycles. Explore the impact on processor utilization and performance with various threading models and chip multiprocessors.

jalicia
Download Presentation

Maximizing On-Chip Parallelism Through Simultaneous Multithreading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 18 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

  2. Processor Under-Utilization • Wide gap between average processor utilization • and peak processor utilization • Caused by dependences, long latency instrs, • branch mispredicts • Results in many idle cycles for many structures

  3. Superscalar Utilization Thread-1 Time V waste H waste • Suffers from horizontal waste • (can’t find enough work in a cycle) • and vertical waste (because of • dependences, there is nothing to • do for many cycles) • Utilization=19% • vertical:horizontal waste = 61:39 Resources (e.g. FUs)

  4. Chip Multiprocessors Thread-1 Thread-2 Time V waste H waste • Single-thread performance goes • down • Horizontal waste reduces Resources (e.g. FUs)

  5. Fine-Grain Multithreading Thread-1 Thread-2 Time V waste H waste • Low-cost context-switch at a fine • grain • Reduces vertical waste Resources (e.g. FUs)

  6. Simultaneous Multithreading Thread-1 Thread-2 Time V waste H waste • Reduces vertical and horizontal • waste Resources (e.g. FUs)

  7. Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs

  8. Chip Multi-Processor Private Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Exec Engine Exec Engine Exec Engine Exec Engine Regs IQ Private Exec Engine DCache FUs

  9. SMT Vs. CMP • In CMP, each thread has 50 regs, in-flight window • of 18, 10 issue queue entries, 8KB data cache • In SMT, two threads can have 60 regs, in-flight • window of 28, 15 issue queue entries, 12KB cache; • two threads have 40 regs, window of 8, 5 issue • queue entries, 4KB cache • CMP is easier to design, has higher clock speed (?) • SMT has higher utilization, single-thread IPC

  10. Impact on Clock Speed • To support four threads, you need four times as • many registers, and potentially larger queues and • caches – in other words, the execution engine is • much larger • The execution engine can be clustered • SMT and CMP start looking very similar • SMT allows a thread to span multiple clusters • SMT improves performance by investing in interconnects between processing units

  11. Clustered SMT Front End Front End Front End Front End Exec engines

  12. Evaluated Models • Fine-Grained Multithreading • Unrestricted SMT • Restricted SMT • X-issue: A thread can only issue up to X instrs in a cycle • Limited connection: each thread is tied to a fixed FU

  13. Results • SMT nearly eliminates horizontal waste • In spite of priorities, single-thread performance degrades (cache contention) • Not much difference between private and shared caches – however, with • few threads, the private caches go under-utilized

  14. Comparison of Models • Bullet

  15. CMP vs. SMT • Bullet

  16. Next Class’ Paper • “Exploiting Choice: Instruction Fetch and Issue • on an Implementable Simultaneous Multithreading • Processor”, D.M. Tullsen, S.J. Eggers, J.S. Emer, • H.M. Levy, J.L. Lo, R.M. Stamm, Proceedings of • ISCA-23, May 1996

  17. Title • Bullet

More Related