Advanced Computer Architecture CSE 8383

Advanced Computer ArchitectureCSE 8383 April 17, 2008 Session 11

Contents • Multi-Core Why now? A Paradigm Shift Multi-Core Architecture • Case Studies IBM Cell Intel Core 2Duo AMD

The Path to Multi-Core

Background • Wafer Thin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed • Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip . the smaller the chip, the more of them that can be made from a single wafer. • Circuit Size The level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (mm) or Nanometer (nm)

Examples • 386C • Die Size: 42 mm2 • 1.0 m technology • 275,000 transistors • 486C • Die Size: 90 mm2 • 0.7 m technology • 1.2 million transistors • Pentium III • Die Size: 106 mm2 • 0.18m technology • 28 million transistors • Pentium • Die Size: 148 mm2 • 0.5 m technology • 3.2 million transistors

Pentium III (0.18 m process technology) Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

nm Process Technology

Increasing Die Size • Using the same technology Increasing the Die Size 2-3X  1.5-1.7X in Performance. • Power is proportional to Die-area * Frequency We cannot produce microprocessors with ever increasing Die size – The constraint is POWER

Reducing circuit Size Reducing circuit size in particular is key to reducing the size of the chip. • The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip. • The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters.

Shrink transistors by 30% every generation  transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases. Gate thickness cannot keep on shrinking  slowing frequency increase, less threshold voltage reduction.

Generation i +1 Generation i Processor Evolution • Gate delay reduces by 1/ (frequency up by ) • Number of transistors in a constant area goes up by 2 (Deeper pipelines, ILP, more cashes) • Additional transistors enable an additional increase in performance • Result: 2x performance at roughly equal cost (0.5 mm, for example) (0.35 mm, for example)

What happens to power if we hold die size constant at each generation? Allows ~ 100% growth in transistors each generation Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

What happens to die Size if we hold power constant at each generation? Die size has to reduce ~ 25% in area each generation  50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Power Density continues to soar Source: Intel Developer Forum, Spring 2004 Pat Gelsinger (Pentium at 90 W)

Business as Usual won’t work: Power is a Major Barrier • As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges • Higher costs: • Thermal Packaging • Fans • Electricity • Air conditioning

A new Paradigm Shift Old Paradigm Performance == improved Frequency, unconstrained power, voltage scaling New Paradigm: Performance == improved IPC, Multi-core, power efficient micro architecture advancement

Multiple CPUs on a Single Chip An attractive option for chip designers because of the availability of cores from earlier processor generations, which, when shrunk down to present-day process technology, are small enough for aggregation into a single die

Technology Generation i Technology Generation i+1 Multi-core Generation i Generation i Generation i • Gate delay does not reduce much • The frequency and performance of each core is the same or a little less than previous generation

Many-core Era Massively Parallel Applications 100 Multi-core Era Scalar and Parallel Applications 10 Increasing HW Threads HT 1 2003 2005 2007 2009 2011 2013 From HT to Many-Core Intel predicts 100’s of cores on a chip in 2015

Multi-cores are Reality # of Cores Source: Saman Amarasinghe, MIT (6.189 2007, lecture-1)

Multi-Core Architecture

Multi-core Architecture • Multiple cores are being integrated on a single chip and made available for general purpose computing • Higher levels of integration – • multiple processing cores • Caches • memory controllers • some I/O processing) • Network on Chip (NoC)

M M M M M M M M Interconnection Networks P P P P P P P P P Interconnection Networks Shared memory • One copy of data shared among multiple cores • Synchronization via locking • intel Distributed memory • Cores access local data • Cores exchange data

Memory Access Alternatives • Symmetric Multiprocessors (SMP) • Message Passing (MP) • Distributed Shared Memory (DSM)

Network on Chip (NoC) control data I/O Switch Network Traditional Bus

P P P P P P PC PC PC PC PC PC SC SC SC Secondary Cache Global Memory Global Memory P P P Primary Cache Secondary Cache Global Memory Shared Memory Shared Primary Cache Shared Secondary Cache Shared Global Memory

CPU core N CPU core 1 registers registers L1 I$ L1 I$ L1 D$ L1 D$ L2 cache L2 cache General Architecture CPU core registers L1 I$ L1 D$ L2 cache main memory I/O main memory I/O Multiple cores Conventional Microprocessor

CPU core 1 CPU core N CPU core 1 CPU core N regs regs regs regs registers registers regs regs regs regs L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L2 cache L2 cache main memory I/O main memory I/O General Architecture (cont) Shared Cache Multithreaded Shared Cache

“Case Studies”

Case Study 1:“IBM’s Cell Processor”

Cell Highlights Supercomputer on a chip Multi-core microprocessor(9 cores) >4 Ghz clock frequency 10X performance for many applications

Key Attributes Cell is Multi-core -Contains 64-bit power architecture -Contains 8 synergetic processor elements Cell is a Broadband Architecture -SPE is RISC architecture with SIMD organization and local store -128+ concurrent transactions to memory per processor Cell is a Real-Time Architecture -Resource allocation (for bandwidth measurement) -Locking caching (via replacement management table) Cell is a Security Enabled Architecture -Isolate SPE for flexible security programming

Cell Processor Components

Cell BE Processor Block Diagram

POWER Processing Element (PPE) POWER Processing Unit (PPU) connected to a 512KB L2 cache. Responsible for running the OS and coordinating the SPEs. Key design goals: maximize the performance/power ratio as well as the performance/area ratio. Dual-issue, in-order processor with dual-thread support Utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions.

Synergistic Processing Elements (SPE) Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). Compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.

SPE (cont.) They operate directly on instructions and data from its dedicated local store. They rely on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution. SIMD support can perform operations on 16 8-bit, 8 16-bit, 4 32-bit integers, or 4 single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

Four levels of Parallelism • Blade level  2 cell processors per blade • Chip level  9 cores • Instruction level  Dual issue pipelines on each SPE • Register level  Native SIMD on SPE and PPE VMX

Cell Chip Floor plan

Element Interconnect Bus (EIB) Implemented as a ring Interconnect 12 elements: 1 PPE with 51.2GB/s aggregate bandwidth 8 SPEs: each with 51.2GB/s aggregate bandwidth MIC: 25.6GB/s of memory bandwidth 2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth Support two transfer modes DMA between SPEs MMIO/DMA between PPE and system memory Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007

Element Interconnect Bus (EIB) An EIB consists of the following: Four 16 byte-wide rings (two in each direction) 1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers 1.2 Supports up to 12 data transfers at a time A shared command bus 2.1 Distributes commands 2.2 Sets up end to end transactions 2.3 Handles coherency A central data arbiter to connect the 12 Cell elements 3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007

Element Interconnect Bus (EIB)

Cell Manufacturing Parameters About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production

Cell Power Consumption Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz

Cell Power Management Dynamic Power Management (DPM) Five Power Management States One linear sensor Ten digital thermal sensors

Case Study 2:“Intel’s Core 2 Duo ”

Intel Core 2 Duo Highlights Multi-core microprocessor(2 cores) It has a range of 1.5 to 3 Ghz clock frequency 2X performance for many applications Dedicated level 1 cache and shared level 2 cache Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model It supports 64bit architecture

Intel Core 2 Duo Block Diagram Dedicated L1$ Shared L2$ The twocores exchange data implicitly through the shared level 2 cache

Intel Core 2 Duo Architecture Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory One Copyneeded to be retrieved Core 1had to retrieve the data from Core 2 by going all the way through the FSB and Main Memory

Advanced Computer Architecture CSE 8383