C hip M ulti P rocessor

EE Department, Technion, Israel November 2004 Chip Multi Processor VLSI Architectures Seminar - 048879 Evgeny Bolotin

Outline Intro Single Chip Multiprocessor (IEEE Computer97) IBM Power5 (IEEE Micro 2004) Energy Efficiency (CMP vs. SMT), ICS 2004 Intel Network Processor-IXP2800 Niagara – Sun, October 2004 Summary

When are two heads better than one?

CMP Motivation • How to utilize available silicon? • Speculation (aggressive superscalar) • Simultaneous Multithreading (SMT, Hyperthreading) • Several processors on a single chip • What is a CMP (Chip MultiProcessor)? • Several processors (several masters) • Both shared and distributed memory architectures • Both homogenous and heterogeneous processor types • Why? • Wire Delays • Diminishing of Uniprocessors • Very long design and verification times for modern processors

A Reminder: SMT (Simultaneous Multi Threading) CMP SMT • Pool of execution units (Wide machine) • Several Logical processors • Copy of State for each • Mul. Threads are running concurrently • Better utilization and Latency Tolerance • Simple Cores • Moderate amount of parallelism • Threads are running concurrently on different cores

A Reminder: SMT (Simultaneous Multi Threading) SMT vs. CMP

SMT A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 Superscalar (SS) • For Same area (a billion tr. DRAM area) • Superscalar and SMT: Very Complex • Wide • Advanced Branch prediction • Register Renaming • OOO Instruction Issue • Non-Blocking data caches CMP

SS and SMT vs. CMP • CPU Cores: Three main hardware design problems (of SS and SMT): • Area increases quadratically with core complexity • Number of Registers O(Instruction window size) • Register ports - O(Issue width) • CMP solves this problem (~ linear Area to Issue width) • Longer Cycle Times • Long Wires, many MUXes and crossbars • Large buffers, queues and register files • Clustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties) • CMP allows small cycle time (with little effort) • Small and fast • Relies on software to schedule • Poor ILP • Complex Design and Verification

SMT SS and SMT vs. CMP • Memory: • 12 issue SS or SMT require multiport data cache (4-6 ports) • 2 X 128 Kbyte (2 cycle latency) • CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport) • Shared memory: write through caches CMP

Performance comparison • Compress: (Integer apps) Low ILP and no TLP • • Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand) • + SMT utilizes core resources better • + But CMP has 16 issue slots instead of 12 • • Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) • + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache • • Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 • TLP and PLP become widespread in future applications • Various Multimedia applications • Compilers and OS • Favours CMP • CMP: • Better performance with simple hardware • Higher clock rates, better memory bandwidth • Shorter pipelines • SMT: has better utilizations but CMP has more resources (no wide-issue logic) • Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

IBM Power5 chip: a dual-core multithreaded processorRon Kalla at al.;Micro, IEEE , Volume: 24 , Issue: 2 , Mar-Apr 2004 Power4 system (2001) Power5 system(2004) • 130 nm • Enhancements: • Two-way SMT (additional complexity is unjustified…diminishing + cache trashing ) • L3 closer to L2 (less traffic, system scales better to 64 from 32) • Shared L2 (1.875MB) and L3 (36 MB) • Memory controller on chip • Less Sys. Chips, reduced Mem. latency • 174 Milion trans., 2 cores • L1: 64 K I$, 128KB d$, 128B lines • Shared L2 (1.5MB) and L3 (32 MB) • 5way each (8issue/5 retire) • 100 Instruction window • 15 stage pipe

IBM Power5 chip • 8 metals • 389 mm^2 • L3 directory • on chip MC

IBM Power5 chip • Processor Core: • SMT and Single Thread (ST) operation modes • 8 way fetch and translation (shared resource) • Branch prediction (shared) • 3 BHT (shared) • Return stack (separate) • 120 GPRs and 120 FPRs (dynamically shared in SMT and all used in ST) • Shared issue queues and XUs

IBM Power5 chip • Enhanced SMT: • Dynamic resource balancing (i.e monitors L2$ misses and throttles the thread) • Adjustable thread priority 8 levels- affects decode cycles (by software: idle loop, real-time apps. etc.)

IBM Power5 chip • Dynamic Power Management: • Extensive Dynamic clock gating (on cycle basis) • Dual-Vt for reduced leakage • Low power mode (x32 slower instruction dispatch, see figure) • ST operation: • All physical resources to active thread

Idea: More Flexible CMP (?) Poor Flexibility Bad for legacy code and low TLP

More Flexible CMP (?) Programmable SMTs SMT

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads, Ruchira Sasanka at al, from University of Illinois and Intel Arch. Research Lab , ICS’04 France, For Same Performance : Compare Energy Efficiency of CMP and SMT for MM apps. • MM applications become important and have high TLP • What about energy? • SMT might appear more energy efficient • since it utilizes better the hardware resources WRONG! Compare for many performance points – different core architectures and frequency/voltage

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Design Space: • 2-thread and 4-thread systems checked • Based on out-of-order superscalar (MIPS R10000) • Fetch/decode from vary from 2 to 8 • other parameters, i.e. window size, number of executions units change accordingly • Frequency vary from 600 MHz to 1.6 GHZ (and scale voltage accordingly)

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads • Core and Memory parameters: • In SMT: • Threads share most of the resources • Separate BPT, return address stack and 32 Integer and 32 FP registers • ICOUNT policy to prioritize Instr. Fetch (prefer “unstuck” thread) • Caches are modeled and the size is chosen to achieve 99% hit-rate for I$ and 98% for D$ • CMP: 16K for I$ and 8K for D$ • SMT was given same amount of cache per thread

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads • Workload: • 8 representing single threaded Multi-Media apps (video and speech codecs) • Given benchmark can be parallelized – each thread process different frame • Consider two- and four- threaded benchmarks • Run same number of frames on both systems • Compare using Energy per Instruction (EPI) and Time per Instruction (TPI) • Simulation Environment: • Modified RSIM simulator-cycle level simulator, models branch and address speculation and contentions. • Use enhanced Wattch for dynamic power modeling • Model Bus power • Static power is modeled by HotLeakage model (2% of dynamic at 0.13 um ) • Assume 90% of clock gating

Vary freq. The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Metrics:

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Results: two-thread workloads

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Results: four-thread workloads

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads • Summary of simulation results: • CMP always gives least EPI (they assumed enough threads …) • For four-threads workload CMP is significantly better • Difference increases with increased performance • For fixed system: Overall best architecture can be picked up (by minimum deviation from the “best” curve) • Interesting graph:

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads • IDEAS: ??? • Hybrid much better than SMT and comes very close to CMP • best of all worlds? • Maybe adaptive Architecture and DVS? • Adaptive fetch width? • DVS much easier for CMP? • High and medium performance regions? •  Heterogeneous ( ask Zvika)

Intel Network Processor (IXP 2800) • Xscale processor (700 MHz) • 16 Microengines (1.4 GHz) • Power dissipation ~30 Watts • Package area : 38x38 mm^2 • Vdd = 1.3 V

Intel Network Processor (IXP 2800) Microengine: • 6 stage pipeline • Huge register set • Multiple thread (8) • Memories • Hardware accelerators.

Niagara: • 8 Ultra Sparc II x 4 threads • They hope for x15 performance Sun Roadmap Blades:

27 October 2004 (EETimes): Sun Microsystems has manufactured first silicon of its next-generation Niagara processor, which isn't due to ship until 2006. The advanced chip contains eight 64-bit UltraSparc cores and will power new systems, which Sun plans to position as "throughput computing" engines capable of handing network-intensive tasks. "We are now with a working chip," David Yen, Sun's executive vice president for scalable systems, tells VARBusiness. "The 1.0 design is running in the lab. It's running the Solaris 9 operating system on top of it, with 32 application threads on top of Solaris." The eight-core Niagarawill dissipate only 60 W of power, according to Yen. That's a fraction of the 100 W or so consumed by today's dual-core UltraSparc IV and is also likely beneath the power figure expected from dual-core processors due out of Intel and AMD in 2005. Niagara will be fabricated in an advanced 90-nm process. It also boasts a host of on-chip features, which make its design highly integrated. The initial version will include an on-board Ethernet controller and a built-in memory controller. Subsequent versions, according to Yen, "will have 10-Gigabit Ethernet and even cryptologic [capability] built on the chip." Sun says the year-long interval until the chip comes to market will be used to bring Sun's partners up to speed.

Conclusions • CMP reduces hardware/power overhead • SMT can yield better single-thread (at high cost) • CMP can improve application performance if the compiler can extract thread-level parallelism • What is the most effective use of on-chip real estate? • Depends on the workload • Depends on compiler technology • Hybrid/Reconfigurable? • Heterogeneous?

References: • VLSI Architectures Lecture (Uri Weiser) • Hyper-Threading Technology Architecture and Microarchitecture, Intel • A Single Chip Multiprocessor ,L. Hammond at al. (Stanford), IEEE Computer 97 • IBM Power5 chip: a dual-core multithreaded processor, Kalla, R.; Balaram Sinharoy; Tendler, J.M.; Micro, IEEE , Volume: 24 , Issue: 2 , Mar-Apr 2004 Pages:40 – 47 • The Energy Efficiency of CMP vs. SMT for Multimedia Workloads, Ruchira Sasanka at al (University of Illinois and Intel Arch. Research Lab), ICS’04 • Network processors, Intel Technology Journal. • “Sun weaves multi-media future”,K. Krewel Microprocessor report, april 2003

Energy: CMP vs SMT

C hip M ulti P rocessor