1 / 61

Advanced Computer Architecture CSE 8383

Advanced Computer Architecture CSE 8383. April 17, 2008 Session 11. Contents. Multi-Core Why now? A Paradigm Shift Multi-Core Architecture Case Studies IBM Cell Intel Core 2Duo AMD. The Path to Multi-Core. Background. Wafer

bebe
Download Presentation

Advanced Computer Architecture CSE 8383

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer ArchitectureCSE 8383 April 17, 2008 Session 11

  2. Contents • Multi-Core Why now? A Paradigm Shift Multi-Core Architecture • Case Studies IBM Cell Intel Core 2Duo AMD

  3. The Path to Multi-Core

  4. Background • Wafer Thin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed • Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip . the smaller the chip, the more of them that can be made from a single wafer. • Circuit Size The level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (mm) or Nanometer (nm)

  5. Examples • 386C • Die Size: 42 mm2 • 1.0 m technology • 275,000 transistors • 486C • Die Size: 90 mm2 • 0.7 m technology • 1.2 million transistors • Pentium III • Die Size: 106 mm2 • 0.18m technology • 28 million transistors • Pentium • Die Size: 148 mm2 • 0.5 m technology • 3.2 million transistors

  6. Pentium III (0.18 m process technology) Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

  7. nm Process Technology

  8. Increasing Die Size • Using the same technology Increasing the Die Size 2-3X  1.5-1.7X in Performance. • Power is proportional to Die-area * Frequency We cannot produce microprocessors with ever increasing Die size – The constraint is POWER

  9. Reducing circuit Size Reducing circuit size in particular is key to reducing the size of the chip. • The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip. • The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters.

  10. Shrink transistors by 30% every generation  transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases. Gate thickness cannot keep on shrinking  slowing frequency increase, less threshold voltage reduction.

  11. Generation i +1 Generation i Processor Evolution • Gate delay reduces by 1/ (frequency up by ) • Number of transistors in a constant area goes up by 2 (Deeper pipelines, ILP, more cashes) • Additional transistors enable an additional increase in performance • Result: 2x performance at roughly equal cost (0.5 mm, for example) (0.35 mm, for example)

  12. What happens to power if we hold die size constant at each generation? Allows ~ 100% growth in transistors each generation Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

  13. What happens to die Size if we hold power constant at each generation? Die size has to reduce ~ 25% in area each generation  50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

  14. Power Density continues to soar Source: Intel Developer Forum, Spring 2004 Pat Gelsinger (Pentium at 90 W)

  15. Business as Usual won’t work: Power is a Major Barrier • As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges • Higher costs: • Thermal Packaging • Fans • Electricity • Air conditioning

  16. A new Paradigm Shift Old Paradigm Performance == improved Frequency, unconstrained power, voltage scaling New Paradigm: Performance == improved IPC, Multi-core, power efficient micro architecture advancement

  17. Multiple CPUs on a Single Chip An attractive option for chip designers because of the availability of cores from earlier processor generations, which, when shrunk down to present-day process technology, are small enough for aggregation into a single die

  18. Technology Generation i Technology Generation i+1 Multi-core Generation i Generation i Generation i • Gate delay does not reduce much • The frequency and performance of each core is the same or a little less than previous generation

  19. Many-core Era Massively Parallel Applications 100 Multi-core Era Scalar and Parallel Applications 10 Increasing HW Threads HT 1 2003 2005 2007 2009 2011 2013 From HT to Many-Core Intel predicts 100’s of cores on a chip in 2015

  20. Multi-cores are Reality # of Cores Source: Saman Amarasinghe, MIT (6.189 2007, lecture-1)

  21. Multi-Core Architecture

  22. Multi-core Architecture • Multiple cores are being integrated on a single chip and made available for general purpose computing • Higher levels of integration – • multiple processing cores • Caches • memory controllers • some I/O processing) • Network on Chip (NoC)

  23. M M M M M M M M Interconnection Networks P P P P P P P P P Interconnection Networks Shared memory • One copy of data shared among multiple cores • Synchronization via locking • intel Distributed memory • Cores access local data • Cores exchange data

  24. Memory Access Alternatives • Symmetric Multiprocessors (SMP) • Message Passing (MP) • Distributed Shared Memory (DSM)

  25. Network on Chip (NoC) control data I/O Switch Network Traditional Bus

  26. P P P P P P PC PC PC PC PC PC SC SC SC Secondary Cache Global Memory Global Memory P P P Primary Cache Secondary Cache Global Memory Shared Memory Shared Primary Cache Shared Secondary Cache Shared Global Memory

  27. CPU core N CPU core 1 registers registers L1 I$ L1 I$ L1 D$ L1 D$ L2 cache L2 cache General Architecture CPU core registers L1 I$ L1 D$ L2 cache main memory I/O main memory I/O Multiple cores Conventional Microprocessor

  28. CPU core 1 CPU core N CPU core 1 CPU core N regs regs regs regs registers registers regs regs regs regs L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L2 cache L2 cache main memory I/O main memory I/O General Architecture (cont) Shared Cache Multithreaded Shared Cache

  29. “Case Studies”

  30. Case Study 1:“IBM’s Cell Processor”

  31. Cell Highlights Supercomputer on a chip Multi-core microprocessor(9 cores) >4 Ghz clock frequency 10X performance for many applications

  32. Key Attributes Cell is Multi-core -Contains 64-bit power architecture -Contains 8 synergetic processor elements Cell is a Broadband Architecture -SPE is RISC architecture with SIMD organization and local store -128+ concurrent transactions to memory per processor Cell is a Real-Time Architecture -Resource allocation (for bandwidth measurement) -Locking caching (via replacement management table) Cell is a Security Enabled Architecture -Isolate SPE for flexible security programming

  33. Cell Processor Components

  34. Cell BE Processor Block Diagram

  35. POWER Processing Element (PPE) POWER Processing Unit (PPU) connected to a 512KB L2 cache. Responsible for running the OS and coordinating the SPEs. Key design goals: maximize the performance/power ratio as well as the performance/area ratio. Dual-issue, in-order processor with dual-thread support Utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions.

  36. Synergistic Processing Elements (SPE) Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). Compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.

  37. SPE (cont.) They operate directly on instructions and data from its dedicated local store. They rely on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution. SIMD support can perform operations on 16 8-bit, 8 16-bit, 4 32-bit integers, or 4 single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

  38. Four levels of Parallelism • Blade level  2 cell processors per blade • Chip level  9 cores • Instruction level  Dual issue pipelines on each SPE • Register level  Native SIMD on SPE and PPE VMX

  39. Cell Chip Floor plan

  40. Element Interconnect Bus (EIB) Implemented as a ring Interconnect 12 elements: 1 PPE with 51.2GB/s aggregate bandwidth 8 SPEs: each with 51.2GB/s aggregate bandwidth MIC: 25.6GB/s of memory bandwidth 2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth Support two transfer modes DMA between SPEs MMIO/DMA between PPE and system memory Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on  NOCS 2007

  41. Element Interconnect Bus (EIB) An EIB consists of the following: Four 16 byte-wide rings (two in each direction) 1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers 1.2 Supports up to 12 data transfers at a time A shared command bus 2.1 Distributes commands 2.2 Sets up end to end transactions 2.3 Handles coherency A central data arbiter to connect the 12 Cell elements 3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on  NOCS 2007

  42. Element Interconnect Bus (EIB)

  43. Cell Manufacturing Parameters About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production

  44. Cell Power Consumption Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz

  45. Cell Power Management Dynamic Power Management (DPM) Five Power Management States One linear sensor Ten digital thermal sensors

  46. Case Study 2:“Intel’s Core 2 Duo ”

  47. Intel Core 2 Duo Highlights Multi-core microprocessor(2 cores) It has a range of 1.5 to 3 Ghz clock frequency 2X performance for many applications Dedicated level 1 cache and shared level 2 cache Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model It supports 64bit architecture

  48. Intel Core 2 Duo Block Diagram Dedicated L1$ Shared L2$ The twocores exchange data implicitly through the shared level 2 cache

  49. Intel Core 2 Duo Architecture Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory One Copyneeded to be retrieved Core 1had to retrieve the data from Core 2 by going all the way through the FSB and Main Memory

More Related