1 / 18

Maximizing On-Chip Parallelism with Simultaneous Multithreading (SMT)

Explore the utilization of functional units in processors through Simultaneous Multithreading (SMT) to maximize on-chip parallelism. Learn about the design challenges, performance impact, and benefits of SMT in modern processors. Discover how SMT enables concurrent execution of multiple threads, enhances instruction issue and completion, and optimizes resource allocation and performance. Dive into the implementation details and improvements made in processor architectures to support SMT. Unveil the power and efficiency of SMT in enhancing processor throughput and performance across various applications, from single-threaded predecessors to advanced multithreaded designs.

Download Presentation

Maximizing On-Chip Parallelism with Simultaneous Multithreading (SMT)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HY425 – Αρχιτεκτονική ΥπολογιστώνΔιάλεξη 10 Δημήτρης Νικολόπουλος, Αναπληρωτής Καθηγητής Τμήμα Επιστήμης Υπολογιστών Πανεπιστήμιο Κρήτης http://www.csd.uoc.gr/~hy425 ΗΥ425 - Διάλεξη 10

  2. For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. ΗΥ425 - Διάλεξη 10

  3. Do both ILP and TLP? • TLP and ILP exploit two different kinds of parallel structure in a program • Could a processor oriented at ILP to exploit TLP? • functional units are often idle in data path designed for ILP because of either stalls or dependences in the code • Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? • Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? ΗΥ425 - Διάλεξη 10

  4. Simultaneous Multi-threading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes ΗΥ425 - Διάλεξη 10

  5. Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • Large set of virtual registers that can be used to hold the register sets of independent threads • Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW • Just adding a per thread renaming table and keeping separate PCs • Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha” ΗΥ425 - Διάλεξη 10

  6. Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot ΗΥ425 - Διάλεξη 10

  7. Design Challenges in SMT • Since SMT makes sense only with fine-grain implementation, impact of fine-grain scheduling on single thread performance? • A preferred thread approach sacrifices neither throughput nor single-thread performance? • Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls • Larger register file needed to hold multiple contexts • Not affecting clock cycle time, especially in • Instruction issue - more candidate instructions need to be considered • Instruction completion - choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance ΗΥ425 - Διάλεξη 10

  8. Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Power 4 ΗΥ425 - Διάλεξη 10

  9. Power 4 2 commits (architected register sets) Power 5 2 fetch (PC),2 initial decodes ΗΥ425 - Διάλεξη 10

  10. Power 5 data flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck ΗΥ425 - Διάλεξη 10

  11. Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they “owned” the machine. ΗΥ425 - Διάλεξη 10

  12. Changes in Power 5 to support SMT • Increased associativity of L1 instruction cache and the instruction address translation buffers • Added per thread load and store queues • Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches • Added separate instruction prefetch and buffering per thread • Increased the number of virtual registers from 152 to 240 • Increased the size of several issue queues • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support ΗΥ425 - Διάλεξη 10

  13. Initial Performance of SMT • Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate • Pentium 4 is dual threaded SMT • SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark • Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20 • Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate • Power 5 running 2 copies of each app speedup between 0.89 and 1.41 • Most gained some • Fl.Pt. apps had most cache conflicts and least gains ΗΥ425 - Διάλεξη 10

  14. Head to Head ILP competition ΗΥ425 - Διάλεξη 10

  15. Performance on SPECint2000 ΗΥ425 - Διάλεξη 10

  16. Performance on SPECfp2000 ΗΥ425 - Διάλεξη 10

  17. Normalized Performance: Efficiency ΗΥ425 - Διάλεξη 10

  18. No Silver Bullet for ILP • No obvious over all leader in performance • The AMD Athlon leads on SPECInt performance followed by the Pentium 4, Itanium 2, and Power5 • Itanium 2 and Power5, which perform similarly on SPECFP, clearly dominate the Athlon and Pentium 4 on SPECFP • Itanium 2 is the most inefficient processor both for Fl. Pt. and integer code for all but one efficiency measure (SPECFP/Watt) • Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency • IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT ΗΥ425 - Διάλεξη 10

More Related