1 / 75

Sima Dezső

Többmagos/sokmagos pro cess z or ok-2. Sima Dezső. 20 14 Október. Version 1.4. Áttekintés. 1. Többmagos processzorok megjelenésének szükségszerűsége. 2. Homogén többmagos processzorok. 2.1 Hagyományos többmagos processzorok. 2.2 Sokmagos processzorok. 3. Heterogén többmagos processzorok.

mark-barnes
Download Presentation

Sima Dezső

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Többmagos/sokmagosprocesszorok-2 Sima Dezső 2014Október Version 1.4

  2. Áttekintés 1. Többmagos processzorok megjelenésének szükségszerűsége 2. Homogén többmagos processzorok 2.1 Hagyományos többmagos processzorok 2.2 Sokmagos processzorok 3. Heterogén többmagos processzorok 3.1 Mester/szolga elvű többmagos processzorok 3.2 Csatolt többmagos processzorok 4. Kitekintés

  3. 3. Heterogén többmagos processzorok

  4. 3. Heterogén többmagos processzorok (1) GPU CPU MPC Multicore processors Homogenous multicores Heterogenous multicores Conventional MC processors Manycore processors Master/slave architectures Add-on architectures 2 ≤ n ≤ 8 cores with >8 cores Desktops Servers General purpose computing Experimental systems, prototypes/in production MM/3D/HPC production stage HPC, mobiles production stage 3.1 ábra Többmagos processzorok főbb osztályai

  5. 3.1 Heterogén többmagos mester/szolga elvű TP-ok A Cell processzor

  6. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (1) Cell BE • Sony, IBM és Toshiba közös terméke • Cél: Játékok/multimédia, HPC alkalmazások Playstation 3 (PS3) QS2x Blade Szerver család (2 Cell BE/blade) • Előzmények: 2000 nyara: Az architektúra alapjainak meghatározása 02/2006: Cell Blade QS20 08/ 2007 Cell Blade QS21 05/ 2008 Cell Blade QS22

  7. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (2) SPE: Synergistic Procesing Element SPU: Synergistic Processor Unit SXU: Synergistic Execution Unit LS: Local Store of 256 KB SMF: Synergistic Mem. Flow Unit EIB: Element Interface Bus PPE: Power Processing Element PPU: Power Processing Unit PXU: POWER Execution Unit MIC: Memory Interface Contr. BIC: Bus Interface Contr. XDR: Rambus DRAM 3.1.1 ábra: A Cell BE blokk diagramja [1]

  8. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (3) 3.1.2 ábra: A Cell BE lapka (221mm2, 234 mtrs) [1]

  9. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (4) 3.1.3 ábra: A Cell BE lapka – EIB [1]

  10. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (5) 3.1.4 ábra: Az EIB működési elve [1]

  11. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (6) 3.1.5 ábra: Konkurens átvitelek az EIB-en [1]

  12. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (7) Példa egy komplex alkalmazás futtatása (digitális TV dekódolása) a Cell processzoron [2]

  13. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (8) A Cell teljesítménye és a NIK részvétele a Cell teljesítmény-vizsgálataiban • Teljesítmény @ 3.2 GHz: QS21 Csúcs SP FP: 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle) • Cell BE - NIK 2007: Faculty Award (Cell 3Đ app./Teaching) 2008: IBM – NIK Kutatási Együttműködési Szerződés: Teljesítményvizsgálatok • IBM Böblingen Lab • IBM Austin Lab

  14. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (9) The Roadrunner 6/2008 : International Supercomputing Conference, Dresden A világ 500 leggyorsabb számítógépe listáján (Top500): 1. Roadrunner 1 Petaflops (1015) fenntartott teljesítmény (Linpack)

  15. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (10) 3.1.6 ábra:A világ leggyorsabb számítógépe: IBM Roadrunner (Los Alamos 2008) [3]

  16. 3.1 Heterogén mester/szolga elvű TP-ok - A Cell (11) 3.1.7 ábra: A Roadrunner főbb jellemzői [1]

  17. 3.2 Heterogén csatolt többmagos processzorok

  18. 3.2 Heterogén csatolt többmagos processzorok (1) GPU CPU MPC Multicore processors Homogenous multicores Heterogenous multicores Conventional MC processors Manycore processors Master/slave architectures Add-on architectures 2 ≤ n ≤ 8 cores with >8 cores Desktops Servers General purpose computing Experimental systems, prototypes/in production MM/3D/HPC production stage HPC, mpbiles production stage 3.2.1 ábra: Többmagos processzorok főbb jellemzői

  19. 3.2 Heterogén csatolt többmagos processzorok (2) Csatolt elvű végrehajtás elve GPGPU-k esetén (a legegyszerűbb (kötegelt) szervezést feltételezve) [4] Host Device kernel0<<<>>>() (Adatpárh. progr.) kernel1<<<>>>()

  20. 3.2 Heterogén csatolt többmagos processzorok (3) Megjegyzés a működési elvhez • Heterogén csatolt többmagos processzorok: feldolgozás gyorsítók (accelerators) • A működési elv szempontjából előzmény: heterogén csatolt társprocesszoros rendszerek Példák: korai személyi számítógépek lebegőpontos társprocesszorokkal Intel 286 + 287 386 + 387 Az Intel 486-nak már volt saját “on-chip” lebegőpontos egysége (FPU) (az SX és SL modelek kivételével)

  21. 3.2 Heterogén csatolt többmagos processzorok (4) Heterogén csatolt többmagos processzorok legfontosabb implementációi Heterogén csatolt többmagos processzorok Okostelefonok/táblagépek Integrált grafika

  22. 3.2.1 Az Integrált grafika megjelenése

  23. 3.2.1 Az Integrált grafika megjelenése (1) Áttérés angol nyelvű slide-ok használatára

  24. 3.2.1 Az Integrált grafika megjelenése (2) P P CPU GPU P GPU Mem. CPU NB IG NB Mem. Mem. Periph. Contr. South Bridge South Bridge Implementation of integrated graphics Implementation of integrated graphics On the processor die In the north bridge In a multi-chip processor package on a separate die Both the CPU and the GPU are on separate dies and are mounted into a single package Implementations about 1999 - 2009 Intel’s Havendale (DT) and Auburndale (M) (scheduled for 1H/2009 but cancelled) Arrandale (DT, 1/2010) and Clarkdale (M, 1/2010) Intel’s Sandy Bridge (1/2011) and Ivy Bridge (4/2012) etc. AMD’s Swift (scheduled for 2009 but canceled) AMD’s Bobcat-based APUs (M, 1/2011) Llano APUs (DT, 6/2011) Trinity APUs (DT, Q4/2012) etc.

  25. 3.2.1 Az Integrált grafika megjelenése (3) P P CPU GPU P GPU Mem. CPU NB IG NB Mem. Mem. Periph. Contr. South Bridge South Bridge Implementation of integrated graphics Implementation of integrated graphics On the processor die In the north bridge In a multi-chip processor package on a separate die Both the CPU and the GPU are on separate dies and are mounted into a single package Implementations about 1999 - 2009 Intel’s 2. gen. Nehalem based Havendale (DT) and Auburndale (M) (scheduled for 1H/2009 but cancelled) Westmere based Arrandale (DT, 1/2010) and Clarkdale (M, 1/2010) Intel’s Sandy Bridge (1/2011), Ivy Bridge (4/2012) etc. AMD’s Swift (scheduled for 2009 but canceled) AMD’s Bobcat-based APUs (M, 1/2011) Llano APUs (DT, 6/2011) Trinity APUs (DT, Q4/2012) etc.

  26. 3.2.1 Az Integrált grafika megjelenése (4) MCP Processor 8M PCI-E Thread Thread Thread Thread 4M PCI-E Core Core Havendale processor (Multi-chip package – MCP) Lynnfield processor (Monolithic die) DDR3 GPU Thread Thread Thread Thread Core Core Power DDR3 IMC Thread Thread Power Core DDR3 IMC Graphics Thread Thread Graphics Core DDR3 Display Link DMI (Direct Media Interface) 4 PCIe lanes) DMI Display I/O Control Processors Display I/O Control Processors No integrated graphics VGA Analog Analog SDVO, HDMI Display Port, DVI Digital Digital PCIe, SATA, NVRAM, etc. PCIe, SATA, NVRAM, etc. I/O functions I/O functions Ibexpeak PCH Ibexpeak PCH Example 1: Intel’s Havendale (DT) and Auburndale (M) multi-chip CPU/GPU processor plans [5] • Revealed in 9/2007. • Scheduled for 1H/2009 but cancelled about 1/2009. • Both parts were based on the 2. gen. Nehalem (Lynnfield) architecture (45 nm), as shown below. Same LGA 1160 platform Schedule: • 2H ’08 First Samples • 1H ’09 Production • TDP < 95 W

  27. 3.2.1 Az Integrált grafika megjelenése (5) Example 2: Intel’s Westmere-EP based multi-chip CPU/GPU processors (2010)-1 [6] Clarkdale (desktop) Arrandale (mobile) The CPU and the GMA chips are connected by the QPI bus.

  28. 3.2.1 Az Integrált grafika megjelenése (6) Positioning of Clarkdale (DT) and Arrandale (M) in Intel’s roadmap [7]

  29. 3.2.1 Az Integrált grafika megjelenése (7) Single chip “chipset”, called PCH for Intel’s Westmere-EP based multi-chip CPU/GPU processors (2010) [7] PCH (Peripheral Control Hub)

  30. 3.2.1 Az Integrált grafika megjelenése (8a) (Dedicated graphics via graphics card) Removing the memory controller (MC) from the north bridge to the processor (IMC) [7] (Dedicated graphics via graphics card)

  31. 3.2.1 Az Integrált grafika megjelenése (8) (Dedicated graphics via graphics card) Removing integrated graphics (IGFX) from the north bridge to the processor [7] (Dedicated graphics via graphics card) On extra die

  32. 3.2.1 Az Integrált grafika megjelenése (8b) (Dedicated graphics via graphics card) Connecting discrete graphics immediately to the processor instead the north bridge [7] (Dedicated graphics via graphics card) PCIe 2.0

  33. 3.2.1 Az Integrált grafika megjelenése (9) P P CPU GPU P GPU Mem. CPU NB IG NB Mem. Mem. Periph. Contr. South Bridge South Bridge Implementation of commercial graphics on the processor die Implementation of integrated graphics On the processor die In the north bridge In a multi-chip processor package on a separate die Both the CPU and the GPU are on separate dies and are mounted into a single package Implementations around 1999 - 2009 Intel’s Sandy Bridge (1/2011) and Ivy Bridge (4/2012) etc. AMD’s Swift (scheduled for 2009) AMD’s Bobcat-based APUs (M, 1/2011) and Llano APUs (DT, 6/2011) Trinity APUs (DT, Q4/2012) etc. Intel’s Havendale (DT) and Auburndale (M) (scheduled for 1H/2009 but cancelled) Arrandale (DT, 1/2010) and Clarkdale (M, 1/2010)

  34. 3.2.2 Intel’s Sandy Bridge

  35. 3.2.2 Intel’s Sandy Bridge (1) 3.2.2 Intel’s Sandy Bridge [8] Key microarchitecture features of the Sandy Bridge vs the Nehalem

  36. 3.2.2 Intel’s Sandy Bridge (2) 256 KB L2 (9 clk) 256 KB L2 (9 clk) 256 KB L2 (9 clk) 256 KB L2 (9 clk) 256 KB L2 (9 clk) 256 KB L2 (9 clk) 256 KB L2 (9 clk) Hyperthreading AES Instr. VMX Unrestrict. 20 nm2 / Core 32K L1D (3 clk) AVX 256 bit 4 Operands @ 1.0 1.4 GHz (to L3 connected) (25 clk) PCIe 2.0 256 b/cycle Ring Architecture DDR3-1600 Die plot of the 4C Sandy Bridge processor[9] Sandy Bridge 4C 32 nm 995 mtrs/216 mm2 ¼ MB L2/C 8 MB L3

  37. 3.2.2 Intel’s Sandy Bridge (3) 1 Block diagram of Intel’s Sandy Bridge with 6 Series PCH [10] Core i3-21xx, 2C, 2/2011 Core i5-23xx/24xx/25xx, 4C, 1/2011 Core i7-26xx, 4C, 1/2011 Intel 6 series PCH1 1Except P67 that does not provide a display controller in the PCH

  38. 3.2.2 Intel’s Sandy Bridge(4) Graphics performance increase of subsequent Core generations [33] Haswell Ivy Bridge Sandy Bridge

  39. 3.2.3 AMD’s Swift Fusion APU plan

  40. 3.2.3AMD’s Swift Fusion APU plan (1) 3.2.3 AMD’s Swift Fusion APU plan Preliminaries In 10/2006 AMD acquired the graphics firm ATI and at the same day they announced that “AMD plans to create a new class of x86 processors that integrate the central processing unit (CPU) and graphics processing unit (GPU)at the silicon level, codenamed “Fusion [13].” Remark Although in the above statementAMD designatedthe silicon level integration of the CPU and GPU as the Fusion initiative, in some other publicationsthey call both the package level and the silicon level integration of the CPU and GPUas the Fusion technology, as shown in the next figure [14]

  41. 3.2.3AMD’s Swift Fusion APU plan (2) Extended interpretation of the term Fusion technology in some AMD publications [14] Despite this disambiguation, subsequently AMD understood the term Fusion usually as the silicon level integration of the CPU and the GPU.

  42. 3.2.3AMD’s Swift Fusion APU plan (3) • In 12/2007 at their Financial Analyst Day AMD gave birth to a new term by designating their processors implementing the Fusion concept as APUs (Accelerated Processing Units). • At the same time AMD announced their first APU family called the Swift family [15] as well.

  43. 3.2.3AMD’s Swift Fusion APU plan (4) • In 11/2008 again at their Financial Analyst Day AMD postponed the introduction of Fusion-basedAPU processors until the company transitions to the 32 nm technology [16] [17]. No Swift APU!

  44. 3.2.3AMD’s Swift Fusion APU plan (5) Remark This is a similar move as done by Intel with their 45 nm Havendale (DT) and Auburndale (M) in-package integrated multi-chip CPU+GPU projects. As leaked from industry sources in 1/2009 Intel canceled their 45 nm multi-chip processor plans in favor of 32-nm multi-chip processors to be introduced in Q1/2010 [18].

  45. 3.2.4 AMD’s K12 (Llano)-based APU lines

  46. 3.2.4 AMD’s K12 (Llano)-based APU lines-1 3.2.4 Overview of AMD’s desktop and mobile APU lines-1(based on [37]) Fam. 15h Mod. 10h-1Fh Fam. 15h Mod. 00h-0Fh K10 Steamroller Fam. 15h Mod. 30h-3Fh 28 nm GPU-less Family 15h lines K 10 / K 10.5 / F a m i l y 11h l i n e s Fam. 12h Hound(K10.5/Stars) Bobcat Fam. 14h Brazos (Zacate) 1-2 Cores 1MB DX11 GPU Core DDR3 Jaguar Fam. 16h Family 12h – 16h APU (Fusion) lines Hound(K10.5/Stars) Family 11h Jaguar Fam. 16h Fam. 14h Brazos (Desna) 2 Cores 1MB DX11 GPU Core DDR3 Tablet

  47. 3.2.4 AMD’s K12 (Llano)-based APU lines-2 3.2.4 Overview of AMD’s notebook and tablet APU lines-2 PWc Basically Tdie c tuned by Tdie m Basically PWc tuned by Tdie m Tdie c PWc ≤ TDP: Turbo mode PWc < TDP: Decrease fc PWc ≤ TDP: Turbo mode PWc ≥ TDP: Decrease fc Tdie c ≤ Tdie c max: Turbo mode Tdie c > Tdie c max: Decrease fc Tdie c ≤ Tdie c max: Turbo mode Tdie c > Tdie c max: Decrease fc If Tdie m < Tdie m max increase fc additionally up to fc max ,as long as Tdie m ≤ Tdie m max If Tdie m < Tdie m max increase fc additionally up to fc max ,aslong as Tdie m < Tdie m max Intel’s Turbo Boost in Nehalem (2008) Intel’s Turbo Boost 2.0 inSandy Bridge (2011) Ivy Bridge (2012) Haswell (2013) Westmere based Arrandale M (2010) AMD’s Hybrid Boost in K15 Piledriver based Richland (2013) AMD’s Turbo Core 2.0 in K12 Llano APU (2011) K16 Jaguar based Kabini/Temash (2013) AMD’s Turbo Core 3.0 in K15 Piledriver based Trinity (2012) PWc: Calculated power consumption PWm: Measured power consumption Tdie c: Calculated die temperature Tdie m: Measured die temperature http://www.anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview

  48. 3.2.5 AMD’s K12 (Llano)-based APU lines (3) 3.2.4 AMD’s Llano-based APU lines [19] • Introduced: 6/2011. • The Llano line belongs to the Fusion APU (Accelerated Processing Unit) series as it includes beyond a number of CPUs also a GPU to accelerate vision computing (graphics and media). • Processors of the Llano lines have up to 4 CPU cores and a GPU. Nevertheless, AMD sells Llano based desktop lines as well with disabled GPUs. These lines are branded as Athlon II X4/X2 or Sempron lines. • 32 nm technology, 228 mm2, 1450 mtrs.

  49. 3.2.5 AMD’s K12 (Llano)-based APU lines (4) Die plot of the Llano processor [20]

  50. 3.2.5 AMD’s K12 (Llano)-based APU lines (5) Example: AMD’s Llano-based A-series mobile lines [21]

More Related