CS136, Advanced Architecture

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading

Outline • Limits to ILP (another perspective) • Thread Level Parallelism • Multithreading • Simultaneous Multithreading • Power 4 vs. Power 5 • Head to Head: VLIW vs. Superscalar vs. SMT • Commentary • Conclusion CS136

Limits to ILP • Conflicting studies of amount • Benchmarks (vectorized Fortran FP vs. integer C programs) • Hardware sophistication • Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? • Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints • Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock • Motorola AltiVec: 128 bit ints and FPs • Supersparc Multimedia ops, etc. CS136

Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupledwith realistic hardware will overcome these limits in near future CS136

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements)2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle CS136

Limits to ILP HW Model comparison CS136

Upper Limit to ILP: Ideal Machine(Figure 3.1) FP: 75 - 150 Integer: 18 - 60 Instructions Per Clock CS136

More Realistic HW: Window ImpactFigure 3.2 Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 Integer: 8 - 63 IPC CS136

More Realistic HW: Branch ImpactFigure 3.3 FP: 15 - 45 Change from Infinite window to 2048, and maximum issue of 64 instructions per clock cycle Integer: 6 - 12 IPC CS136 Perfect Tournament BHT (512) Profile No prediction

Misprediction Rates CS136

More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5 FP: 11 - 45 Change to 2048 instr window, 64 instr issue, 8K 2 level Prediction IPC Integer: 5 - 15 Infinite 256 128 64 32 None CS136

More Realistic HW: Memory Address Alias ImpactFigure 3.6 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) IPC Integer: 4 - 9 Perfect Global/Stack perf;heap conflicts Compiler Inspection None CS136

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window Realistic HW: Window Impact(Figure 3.7) FP: 8 - 45 IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4 CS136

Outline • Limits to ILP (another perspective) • Thread Level Parallelism • Multithreading • Simultaneous Multithreading • Power 4 vs. Power 5 • Head to Head: VLIW vs. Superscalar vs. SMT • Commentary • Conclusion CS136

How to Exceed ILP Limitsof This Study? • These are not laws of physics • Just practical limits for today • Could be overcome via research • Compiler and ISA advances could change results • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage • Can get conflicts via allocation of stack frames • Because called procedure reuses memory addresses of previous stack frames CS136

HW v. SW to increase ILP • Memory disambiguation: HW best • Speculation: • HW best when dynamic branch prediction better than compile-time prediction • Exceptions easier for HW • HW doesn’t need bookkeeping code or compensation code • Very complicated to get right in SW • Scheduling: SW can look ahead to schedule better • Compiler independence: HW does not require new compiler to run well CS136

Performance Beyond Single-Thread ILP • Much higher natural parallelism in some applications • Database or scientific codes • Explicit thread-levelor data-level parallelism • Thread: has own instructions and data • May be part of parallel program or independent program • Each thread has all state (instructions, data, PC, register state, and so on) needed to execute • Data-level parallelism: Perform identical operations on lots of data CS136

Thread Level Parallelism (TLP) • ILP exploits implicit parallel operations within loop or straight-line code segment • TLP explicitly represented by multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs • TLP could be more cost-effective to exploit than ILP CS136

Do Both ILP and TLP? • TLP and ILP exploit two different kinds of parallel structure in a program • Could a processor oriented to ILP still exploit TLP? • Functional units are often idle in data path designed for ILP because of either stalls or dependencies in the code • Could TLP be used as source of independent instructions that might keep the processor busy during stalls? • Could TLP be used to employ functional units that would otherwise lie idle when insufficient ILP exists? CS136

New Approach:Multithreaded Execution • Multithreading: multiple threads share functional units of 1 processor via overlapping • Processor must duplicate independent state of each thread • Separate copy of register file, PC • Separate page table if different process • Memory sharing via virtual memory mechanisms • Already supports multiple processes • HW for fast thread switch • Must be much faster than full process switch (which is 100s to 1000s of clocks) • When to switch? • Alternate instruction per thread (fine grain)—round robin • When thread is stalled (coarse grain) • E.g., cache miss CS136

Fine-Grained Multithreading • Switches between threads on each instruction, interleaving execution of multiple threads • Usually done round-robin, skipping stalled threads • CPU must be able to switch threads every clock • Advantage: can hide both short and long stalls • Instructions from other threads always available to execute • Easy to insert on short stalls • Disadvantage: slows individual threads • Thread ready to execute without stalls will be delayed by instructions from other threads • Used on Sun’s Niagara (will see later) CS136

Course-Grained Multithreading • Switches threads only on costly stalls • E.g., L2 cache misses • Advantages • Relieves need to have very fast thread switching • Doesn’t slow thread • Other threads only issue instructions when main one would stall (for long time) anyway • Disadvantage: pipeline startup costs make it hard to hide throughput losses from shorter stalls • Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread • New thread must fill pipe before instructions can complete • Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time • Used in IBM AS/400 CS136

Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • Large set of virtual registers that can be used to hold register sets for independent threads • Register renaming provides unique register identifiers • Instructions from multiple threads can be mixed in data path • Without confusing sources and destinations across threads! • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW • Just add per-thread renaming table and keep separate PCs • Independent commitment can be supported via separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha” CS136

Simultaneous Multithreading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes CS136

Multithreaded Categories SMT Fine-Gr. Coarse-Gr. Superscalar Multiprocessing Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot CS136

Design Challenges in SMT • What is impact on single thread performance? • Preferred-thread approach • Sacrifices neither throughput nor single-thread performance? • Nope: processor will sacrifice some throughput when preferred thread stalls • Larger register file needed to hold multiple contexts • Must not affect clock cycle, especially in: • Instruction issue—more candidate instructions to consider • Instruction completion—hard to choose which to commit • Must ensure that cache and TLB conflicts caused by SMT don’t degrade performance CS136

Digression: Covert Channels • Imagine you’re spy with account on Knuth • Want to communicate a secret to Geoff • Secret is reasonably small • FBI is watching your account and your e-mail • Solution: process spawning • Once a second, Geoff spawns process • Records own PID, waits 10 ms, forks & records child PID • Once a second, you send one bit of information • If bit is zero, you do nothing • If bit is one, you spawn processes as fast as possible • If Geoff sees big PID gap, he records “1”, else “0” • Many variations on this basic idea CS136

Covert-Channel Attacks on Crypto • Most (not all) crypto code behaves differently on “1” bit in key vs. “0” bit • Runs longer or shorter • Uses more or less power • Accesses different memory • Etc. • Usually called “information leakage” • Has been successfully used in lab to crack strong crypto • Even recovering some bits makes brute-force attack practical for getting remainder • Some modern implementations try to fight by doing wasted work on shorter path of “if”, etc. CS136

SMT Attack on SSH • On SMT machine, lower-priority thread’s execution rate depends on higher-priority one’s instructions • More stalls in top thread mean more speed in bottom one • Stalls vary depending on what crypto code is doing • Operates at very low level • Thus much harder to defend against • Successful attack on ssh keys has been demonstrated in lab • Best known defense: don’t do SMT • Careful coding of crypto could probably also work • Note that this also applies to things like cache and TLB • Lots of ways to leak information unintentionally! CS136

Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle. CS136

Power 4 2 commits (architected register sets) Power 5 2 fetch (PC),2 initial decodes CS136

Power 5 data flow ... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck CS136

Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they “owned” the machine. CS136

Changes in Power 5 to support SMT • Increased associativity of L1 instruction cache and instruction address translation buffers • Added per-thread load and store queues • Increased size of L2 (1.92 vs. 1.44 MB) and L3 caches • Added separate instruction prefetch and buffering per thread • Increased virtual registers from 152 to 240 • Increased size of several issue queues • Power5 core is about 24% larger than Power4 because of SMT support CS136

Initial Performance of SMT • Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark; 1.07 for SPECfp_rate • Pentium 4 is dual-threaded SMT • SPECRate requires each benchmark to be run against vendor-selected number of copies of same benchmark • Pairing each of 26 SPEC benchmarks with every other on Pentium 4 (262 runs) gives speedups from 0.90 to 1.58; average was 1.20 • 8-processor Power 5 server 1.23 faster for SPECint_rate w/ SMT, 1.16 faster for SPECfp_rate • Power 5 running 2 copies of each app had speedup between 0.89 and 1.41 • Most gained some • Floating-point apps had most cache conflicts and least gains CS136

Head-to-Head ILP Competition CS136

Performance on SPECint2000 CS136

Performance on SPECfp2000 CS136

Normalized Performance: Efficiency CS136

No Silver Bullet for ILP • No obvious overall leader in performance • AMD Athlon leads on SPECInt performance, followed by the Pentium 4, Itanium 2, and Power5 • Itanium 2 and Power5 clearly dominate Athlon and Pentium 4 on SPECFP • Itanium 2 is most inefficient processor both for floating-point and integer code for all but one efficiency measure (SPECFP/Watt) • Athlon and Pentium 4 both use transistors and area efficiently • IBM Power5 is most effective user of energy on SPECFP, essentially tied on SPECINT CS136

Limits to ILP • Doubling issue rates above today’s 3-6 instructions per clock probably requires processor to: • Issue 3-4 data-memory accesses per cycle, • Resolve 2-3 branches per cycle, • Rename and access over 20 registers per cycle, and • Fetch 12-24 instructions per cycle. • Complexity of implementing these capabilities is likely to mean sacrifices in maximum clock rate • E.g, widest-issue processor is Itanium 2 • It also has slowest clock rate • Despite consuming the most power! CS136

Limits to ILP (cont’d) • Most ways to increase performance also boost power consumption • Key question is energy efficiency: does a method increase power consumption faster than it boosts performance? • Multiple-issue techniques are energy inefficient: • Incurs logic overhead that grows faster than issue rate • Growing gap between peak issue rates and sustained performance • Number of transistors switching = f(peak issue rate); performance = f(sustained rate); growing gap between peak and sustained performance  Increasing energy per unit of performance CS136

Commentary • Itanium is not significant breakthrough in scaling ILP or in avoiding problems of complexity and power consumption • Instead of pursuing more ILP, architects turning to TLP using single-chip multiprocessors • In 2000, IBM announced Power4, 1st commercial single-chip, general-purpose multiprocessor: has two Power3 processors and integrated L2 cache • Sun Microsystems, AMD, and Intel have also switched focus from aggressive uniprocessors to single-chip multiprocessors • Right balance of ILP and TLP is unclear today • Maybe desktops (mostly single-threaded?) need different design than servers (can do lots of TLP) CS136

And in conclusion … • Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options • Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance • Coarse grain vs. Fine grained multihreading • Only on big stall vs. every clock cycle • Simultaneous Multithreading if fine grained multithreading based on OOO superscalar microarchitecture • Instead of replicating registers, reuse rename registers • Itanium/EPIC/VLIW is not a breakthrough in ILP • Balance of ILP and TLP decided in marketplace CS136

CS136, Advanced Architecture