Maximizing On-Chip Parallelism with Simultaneous Multithreading (SMT)

HY425 – Αρχιτεκτονική ΥπολογιστώνΔιάλεξη 10 Δημήτρης Νικολόπουλος, Αναπληρωτής Καθηγητής Τμήμα Επιστήμης Υπολογιστών Πανεπιστήμιο Κρήτης http://www.csd.uoc.gr/~hy425 ΗΥ425 - Διάλεξη 10

For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. ΗΥ425 - Διάλεξη 10

Do both ILP and TLP? • TLP and ILP exploit two different kinds of parallel structure in a program • Could a processor oriented at ILP to exploit TLP? • functional units are often idle in data path designed for ILP because of either stalls or dependences in the code • Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? • Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? ΗΥ425 - Διάλεξη 10

Simultaneous Multi-threading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes ΗΥ425 - Διάλεξη 10

Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • Large set of virtual registers that can be used to hold the register sets of independent threads • Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW • Just adding a per thread renaming table and keeping separate PCs • Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha” ΗΥ425 - Διάλεξη 10

Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot ΗΥ425 - Διάλεξη 10

Design Challenges in SMT • Since SMT makes sense only with fine-grain implementation, impact of fine-grain scheduling on single thread performance? • A preferred thread approach sacrifices neither throughput nor single-thread performance? • Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls • Larger register file needed to hold multiple contexts • Not affecting clock cycle time, especially in • Instruction issue - more candidate instructions need to be considered • Instruction completion - choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance ΗΥ425 - Διάλεξη 10

Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. Power 4 ΗΥ425 - Διάλεξη 10

Power 4 2 commits (architected register sets) Power 5 2 fetch (PC),2 initial decodes ΗΥ425 - Διάλεξη 10

Power 5 data flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck ΗΥ425 - Διάλεξη 10

Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they “owned” the machine. ΗΥ425 - Διάλεξη 10

Changes in Power 5 to support SMT • Increased associativity of L1 instruction cache and the instruction address translation buffers • Added per thread load and store queues • Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches • Added separate instruction prefetch and buffering per thread • Increased the number of virtual registers from 152 to 240 • Increased the size of several issue queues • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support ΗΥ425 - Διάλεξη 10

Initial Performance of SMT • Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate • Pentium 4 is dual threaded SMT • SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark • Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20 • Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate • Power 5 running 2 copies of each app speedup between 0.89 and 1.41 • Most gained some • Fl.Pt. apps had most cache conflicts and least gains ΗΥ425 - Διάλεξη 10

Head to Head ILP competition ΗΥ425 - Διάλεξη 10

Performance on SPECint2000 ΗΥ425 - Διάλεξη 10

Performance on SPECfp2000 ΗΥ425 - Διάλεξη 10

Normalized Performance: Efficiency ΗΥ425 - Διάλεξη 10

No Silver Bullet for ILP • No obvious over all leader in performance • The AMD Athlon leads on SPECInt performance followed by the Pentium 4, Itanium 2, and Power5 • Itanium 2 and Power5, which perform similarly on SPECFP, clearly dominate the Athlon and Pentium 4 on SPECFP • Itanium 2 is the most inefficient processor both for Fl. Pt. and integer code for all but one efficiency measure (SPECFP/Watt) • Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency • IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT ΗΥ425 - Διάλεξη 10

Maximizing On-Chip Parallelism with Simultaneous Multithreading (SMT)

Maximizing On-Chip Parallelism with Simultaneous Multithreading (SMT)

Presentation Transcript