290 likes | 476 Views
High-Performance Computing from Smart Phone, Multi-core CPU to Graphics Processing Unit Jih -Kwon Peir , 裴季鯤 University of Florida June11, 2014. Outline. Organization of a PC system Multi-core Processor Organization Intel cores Processors in Smart Phones IPhone, Galaxy, HTC
E N D
High-Performance Computingfrom Smart Phone, Multi-core CPU to Graphics Processing Unit Jih-Kwon Peir, 裴季鯤University of Florida • June11, 2014
Outline • Organization of a PC system • Multi-core Processor Organization • Intel cores • Processors in Smart Phones • IPhone, Galaxy, HTC • Processors in Game Consoles • PS4, Xbox One • Graphics Processing Unit (GPU) • Nvidia, AMD 06/11/2014 SWUFE
Basic PC Organization Processor 06/11/2014 SWUFE
Today’s Typical PC with GPU CPU (host) GPU w/ local DRAM (device) 06/11/2014 SWUFE
Processor Is Everywhere • Desktop, Laptop: Intel, AMD, Others • Graphics Processing Unit (GPU): Nvidia, AMD, etc. • Smart Phone: iPhone, Galaxy, HTC, etc. • Tablet: iPad, Android Tablets • Game Console: PS4. XBOX One • Clusters, Warehouse-scale Server: Google Search Engineering, Cloud Computing, Data Center Server • Embedded Systems: ARM, MIPS, Others 06/11/2014 SWUFE
What is Computer Architecture (Organization)? • Functional operation of the individual HW units within a computer system, and the flow of information and control among them. Programming Parallelism Technology Language Interface Computer Architecture: Interface Design (ISA) Hardware Organization OS Applications Measurement & Evaluation 06/11/2014 SWUFE
Moving to Multicore (CMP) • Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Power Wall + ILP Wall + Memory Wall New Brick Wall Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) • Simpler processors, more power efficient • Exploit TLP and DLP, not ILP • How to use it: Programmer / compiler involvement 06/11/2014 SWUFE
Uniprocessor Performance Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency
Intel Processor Architecture • Intel processor architecture and technology Map (2012) • Nehalem – 2008, 45nm, quad cores, 3 mem channels, early i7 • Sandy Bridge (Westwere) – 2009, 32nm, upgrade tech re-map • Sandy Bridge-E – lunched Nov. 2011, 32nm, 6 cores, X79 platform, LGA2011 socket, 4 mem channels, 51.2 GB/sec • Ivy Bridge - 22nm, Mar/April, 2012 • Ivy Bridge-E – 22nm, target 4Q, 2012, Ivy Bridge-E will be compatible with today's Intel X79 platform, and LGA2011 socket. • Intel uses "tick-tock" method of processor design for several generations • The "tock" of this design mentality is a new microarchitecture • The "tick" is an upgraded process technology Ivy Bridge-E 22nm TOCK Ivy Bridge 22nm TICK Haswell Broadwell 9
Intel Processor Architecture – Nehalem • Dramatic architecture change – removal of front-side bus • Introduced in 2008, new processor, new CPU socket, new memory architecture, new chipset, new motherboards, and new overclocking methods. • On-die QPI links, three DDR channels, and large (8MB) L3 cache, used in Intel Core i7, configured with 1-8 cores • New SSE 4.2, better branch prediction, prefetch, SMT (2 threads/core) • The QPI has a bandwidth of 12.8GB/s in each direction simultaneously for a combined bi-directional bandwidth of 25.6GB/s, handle multiple PCI-E through 5520 IOH, flexible configurations • Three DDR channels, higher memory bandwidth
Intel Core i7 – Sandy Bridge-E 6 Cores Large L3: 15MB 4 mem. Channels: 51GB/sec
Intel Devil’s Canyon (Haswell), first 4GHz CPU, plus 20th-Year Pentium processor in Computex(6/3/14) 4th generation cores 06/11/2014 SWUFE
Intel Devil’s CanyonFamily 06/11/2014 SWUFE
Early iPhone iPhone3G: Samsung ARM 11 processor running at 412 MHz iPhone3GS: Samsung ARM Cortex A8 600MHz iPhone4: Apple A4 (S5L8930), 750-800MHz (ARM based ISA) iPhone4S, (iPAD2): Apple A5 (S5L8940), 1GHz, dual-core, SOC (iPad 3): Apple A5X (S5L8945), 1GHz, dual-core, SOC iPhone5: AppleA5xxx (S5L8950), 1GB RAM, SOC (with SGX543 GPU variant), speed and core unknown (announced 9/12/2012)
iPhone 6 Left to right: iPhone 3G, iPhone 4, iPhone 5, iPhone 6 mockup (4.7” also has 5.5” later), Retina iPad mini iPhone6: 64-bit 20-nanometer A8 chip from TSMC (depart from Samsung); The A8 chip is rumored both a quad-core 64-bit processor and quad-core graphics;may 2GB of RAM; has 16, 32, 64 GB, and a whopping 128GB of flash RAM powered by iOS8. Series 6XT PowerVR GPUs offers 50% benchmark performance increase to previous chips, good for gaming purpose. For camera, debatable but have a higher pixel count than current iPhones or iPads. Primary > 8 Megapixel, secondary > 1.2 Megapixel. LOOK OUT iPhone 7 is coming with A9 processor!!
Galaxy S5 (May, 14) vs. iPhone 5S (Sep. 13) Processor:Galaxy S5 – Quad-core 2.5 GHz 32-bit krait 400 processor, Qualcomm MSM8974AC Snapdragon 801 chipset,, 2GB RAM, 16-32GB Flash; iPhone 5S – Dual-core Apple A7 64-bit 1.4GHz, 1GB RAM, 16-64GB FlashOS: Android 4.4.2 vs iOS 7 (more efficient) Camera: Galaxy S5 – 16-megapixel ISOCELL sensor, 2MP front camera; iPhone 5S – 8-megapixel, 1.2MP front camera Dimension: Galaxy S5 – 142 x 72.5 x 8.1mm, 145g; iPhone 5S – 123.8 x 58.6 x 7.6mm, 112gScreen: Galaxy S5 – Super AMOLED, 5.1-inch, 1080p resolution; iPhone 5S – IPS LCD, 4-inch, 1,536 x 640 resolution
Qualcomm Snapdragon 805Snapdragon 801 in Galaxy S5, HTC one-M8, Sony Xperia Z2 06/11/2014 SWUFE
Near Field Communication (NFC) • NFCis a standard for smartphones and similar devices to establish radio communication by touching or bringing them in few inches, used in Android, Window phones (not in iPhone yet). • NFCstandards cover communications protocols and data exchange formats, and based on existing radio-frequency identification (RFID) standards. Low power, shorter distance than Bluetooth. • Commerce: contactless payment systems, e.g. Google Wallet similar tocredit cards and other smartcards. • Intel core processor has built-in NXP PN544PC NFC RFID reader chip, capturing the credit card's ID number and transmitting encrypted to merchant, and to MasterCard's MasterPasse-wallet. • Debate for 'mobile wallet'by NFC, or 'digital wallet‘ by PayPal's. PayPal's promotes 'digital wallet' in the cloud, not only to mobile phone, but variety of devices: laptop, iPad, ultrabook or Xbox. • Communication: Android Beam, Jelly Bean, S-Beam use NFC for connection, also can connect to Bluetooth and Wi-Fi. • Social Networking: sharing contacts, photos, videos, files, and entering multiplayer mobile games. 06/11/2014 SWUFE
Gaming Console: Xbox One vs. PS4 CPU:AMD 8-core Jaguar CPU, XBOX runs 1.75GHz, while PS4 1.6GHz GPU and RAM: Xbox One - Comparable to AMD Radeon HD 7000-series, 8GB DDR3 RAM and 32MB eSRAM; PS4 - Comparable to Radeon HD 7000-series, 8GB GDDR5 RAM Memory Bandwidth: PS4’s has176GB/second, to 68GB/second for the Xbox One. GDDR5 is only currently available in 512MB chips, so the console will need a whopping 16 of them. Performance: PS4 performs 50% better due to its GDDR5 RAM. (CPU does not matter much.) Micorsoft increases GPU speed from 800MHz to 853MHz, effectiveness is limited. 06/11/2014 SWUFE
AMD Radeon HD 7000-series 06/11/2014 SWUFE
Comparison: GPU vs Multicore CPU • Difference in utilizing on-chip transistors: • CPU has significant cache space and control logic for general-purpose applications • GPU builds large number of replicated cores for data-parallel, thread-parallel computations • New APU chip combines both! • AMD Radeon HD 7000-series
Nvidia Fermi Graphics Processors - GTX580 • 16 SMs, 32 core / SM • 512 cores • 3 Billion transistors • 768 KB shared L2 cache for SMs (new) • 6 DRAM channels • Host interface • GigaThread scheduler 23
Streaming Multiprocessor in Fermi • Two Warp schedulers and Dispatch units • 16KB register files • Shared instruction cache • 64 KB Shared L1 data cache and Shared (local) memory • Two sets of ALU, 16 cores each (2 cycle initiation latency) • One set of LD/ST, 16 units each (2 cycle latency) • Four SFUs (8 cycle latency) • Separate INT, FP units in each core 24
CUDA Programming Model Host invokes Kernels/Grids to execute on GPU, back to Host Three-level parallelism: Grid, Block, Thread Thread Application Host execution kernel 0 Block 0 Block 1 Block 2 Block 3 ... ... ... ... Host execution kernel 1 Block 0 Block N … … ... ... …
Intel Clusters – Xeon Phi (Knights Corner) 60 06/11/2014 SWUFE
Summary • High-performance, Low-power, Multi-core Processor is everywhere • Smartphone, Tablet, PC, Game Console, Graphics, Communication, Security, Data-Center Cluster, Supercomputer, etc. • Personal electric devices, home appliances, cars and transportation vehicles, medical equipment, E-commerce, E-Bank, Security, Cloud Computing, etc. • Important to have basic understanding about processors!! 06/11/2014 SWUFE
Thank You! Questions? 06/11/2014 SWUFE