Introduction

INF5063:Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 22th, 2014

INF5063

Overview • Course topic and scope • Background for the use and parallel processing using heterogeneous multi-core processors • Examples of heterogeneous architectures • Vector Processing

INF5063:The Course

People • Håkon Kvale Stensland email: haakonks @ ifi • Preben Nenseth Olsen email: prebenno @ ifi • Carsten Griwodzemail: griff @ ifi • Professor Pål Halvorsenemail: paalh @ ifi

Time and place • Lectures: Friday 09:00 - 16:00Storstua (Simula Research Laboratory) • Friday August 22nd • Friday September 19th • Friday October 17th • Friday November 21st • Group exercises: The time reserved on IFI’s pages will not be used!

Plan for Today (Session 1) 09:15 – 10:00: Course Introduction 10:00 – 10:15: Break 10:15 – 11:00: Introduction to SSE/AVX 11:00 – 11:15: Break 11:15 – 12:00: Introduction to Video Processing 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Using SIMD for Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Codec 63 (c63) Home Exam Precode 14:45 – 15:00: Home Exam 1 is presented

About INF5063: Topic & Scope • Content: The course gives … • … an overview of heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core(architectures and use) • … an introduction to working with heterogeneous multi-core processors • SSEx / AVXx for x86 • Nvidia’s family of GPUs and the CUDA programming framework • Multiple machines connected with Dolphin PCIe links • … some ideas of how to use/program heterogeneous multi-core processors

About INF5063: Topic & Scope • Tasks:The important part of the course is lab-assignments where you program each of the three examples of heterogeneous multi-core processors • 3 graded home exams (counting 33% each): • Deliver code and make a demonstration explaining your design and code to the class • On the x86 • Video encoding – Improve the performance of video compression by using SSE instructions. • On the Nvidia graphics cards • Video encoding – Improve the performance of video compression using the GPU architecture. • On the distributed systems • Video encoding – The same as above, but exploit the parallelism on multiple GPUs and computers connected with Dolphin PCIe links. • Students will be working together in groups of two. Try to find a partner during the session today! • Competition at the end of the course! Have the fastest implementation of the code!

Background and Motivation:Moore’s Law

Motivation: Intel View • >billion transistors integrated • 2014: • 7,1 billion - nVIDIA GK110 (Kepler) • 1971: • 2,300 - Intel 4004

Motivation: Intel View • >billion transistors integrated • Clock frequency has not increased since 2006 • 2014 (Still): • 5 GHz – IBM Power6

Motivation: Intel View • >billion transistors integrated • Clock frequency has not increased since 2006 • Power? Heat?

Motivation: Intel View • Soon >billion transistors integrated • Clock frequency can still increase • Future applications will demand TIPS • Power? Heat?

Motivation “Future applications will demand more instructions per second” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores

Multicores!

Symmetric Multi-Core Processors Intel Haswell

Symmetric Multi-Core Processors AMD “Piledriver”

Symmetric Multi-Core Processors • Good • Growing computational power • Problematic • Growing die sizes • Unused resources • Some cores used much more than others • Many core parts frequently unused • Why not spread the load better? • Heterogeneous Architectures!

x86 – heterogeneous?

Intel Haswell Core Architecture Haswell

Intel Haswell Architecture • THUS, an x86 is definitely parallel and has heterogeneous (internal) cores. • The cores have many (hidden) complicating factors. One of these are out-of-order execution.

Out-of-order execution • Beginning with Pentium Pro, out-of-order-execution has improved the micro-architecture design • Execution of an instruction is delayed if the input data is not yet available. • Haswell architecture has 192 instructions in the out-of-order window. • Instructions are split into micro-operations (μ-ops) • ADD EAX, EBX % Add content in EBX to EAX • simple operation which generates only 1 μ-ops • ADD EAX, [mem1] % Add content in mem1 to EAX • operation which generates 2 μ-ops: 1) load mem1 into (unamed) register, 2) add • ADD [mem1], EAX % Add content in EAX to mem1 • operation which generates 3 μ-ops: 1) load mem1 into (unamed) register, 2) add, 3) write result back to memory

Intel Haswell • The x86 architecture “should” be a nice, easy introduction to more complex heterogeneous architectures later in the course…

Co-Processors • The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) • 50-fold speed up of floating point operations • Intel kept the co-processor up to i486 • 486DX contained an optimized i487 block • Still separate pipeline (pipeline flush when starting and ending use) • Communication over an internal bus • Commodore Amiga was one of the earlier machines that used multiple processors • Motorola 680x0 main processor • Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations) • Copper (Co-Processor - change address for video RAM on the fly)

nVIDIA Tegra K1 ARM SoC • One of many multi-core processors for handheld devices • 4 (5?) ARM Cortex-A15 processors • Out-of-order design • 32-bit ARMv7 instruction set • Cache-coherent cores • 2,2 GHz • Several “dedicated” co-processors: • 4K Video Decoder • 4K Video Encoder • Audio Processor • 2x Image Processor • Fully programmable Kepler-family GPU with 192 simple cores.

Intel SCC (Single Chip Cloud Computer) • Prototype processor from Intel’s TerraScale project • 24 «tiles» connected in a mesh • 1.3 billion transistors • Intel SCC Tile: • Two Intel P54C cores Pentium • In-order-execution design • IA32 architecture • NO SIMD units(!!) • No cache coherency • Only made a limited number of chips for research.

Intel MIC («Larabee» / «Knights Ferry» / «Knights Corner») • Introduced in 2009 as a consumer GPU from Intel • Canceled in May 2010 due to “disappointing performance” • Launched in 2013 as a dedicated “accelerator” card (Intel Xeon Phi) • 50+ cores connected with a ring bus • Intel “Larabee” core: • Based on the Intel P54C Pentium core • In-order-execution design • Multi-threaded (4-way) • 64-bit architecture • 512-bit SIMD unit (LBNi) • 256 kB L2 Cache • Cache coherency • Software rendering

General Purpose Computing on GPU • The • high arithmetic precision • extreme parallel nature • optimized, special-purpose instructions • available resources • … … of the GPU allows for general, non-graphics related operations to be performed on the GPU • Generic computing workload is off-loadedfrom CPU and to GPU • More generically: Heterogeneous multi-core processing

nVIDIA Kepler Architecture – GK110 • 7,08 billion transistors • 2880 “cores” • 384 bit memory bus (GDDR5) • 336,40 GB/sec memory bandwidth • 5121 GFLOPS single precision performance • PCI Express 3.0 Found in: Tesla K20c, K20x, K40c and K40x GeForce GTX 780, 780 Ti, Titan, Titan Black and Titan Z Quadro K5200 and K6000

nVIDIA GK110 Streaming Multiprocessor (SMX) nVIDIA GK110 • Fundamental thread block unit • 192 stream processors (SPs)(scalar ALU for threads) • 64 double-precision ALUs • 32 super function units (SFUs)(cos, sin, log, ...) • 65.336 x 32-bit local register files (RFs) • 16 / 48 kB level 1 cache • 16 / 48 kB shared memory • 48 kB Read-Only Data Cache • 1536 kB global level 2 cache

STI (Sony, Toshiba, IBM) Cell • Motivation for the Cell • Cheap processor • Energy efficient • For games and media processing • Short time-to-market • Conclusion • Use a multi-core chip • Design around an existing, power-efficient design • Add simple cores specific for game and media processing requirements

STI (Sony, Toshiba, IBM) Cell • Cell is a 9-core processor • Power Processing Element (PPE) • Conventional Power processor • Not supposed to perform all operations itself, acting like a controller • Running conventional OS • Synergistic Processing Elements (SPE) • Specialized co-processors for specific types of code, i.e., very high performance vector processors • Local stores • Can do general purpose operations • The PPE can start, stop, interrupt and schedule processes running on an SPE

Vector processing: x86, SSE, AVX

Types of Parallel Processing/Computing? • Bit-level parallelism • 4-bit  8-bit  16-bit  32-bit  64-bit  … • Instruction level parallelism • classic RISC pipeline (fetch, decode, execute, memory, write back)

Types of Parallel Processing/Computing? • Task parallelism • Different operations are performed concurrently • Task parallelism is achieved when the processors execute different threads (or processes) on the same or different data • Examples?

Types of Parallel Processing/Computing? • Data parallelism • Distribution of data across different parallel computing nodes • Data parallelism is achieved when each processor performs the same task on different pieces of the data • Examples? • When should we not use data parallelism? for each element a perform the same (set of) instruction(s) on a end

Flynn's taxonomy

Vector processors • A vector processor (or array processor) • CPU that implements an instruction set containing instructions that operate on one-dimensional arrays (vectors) • Example systems: • Cray-1 (1976) • 4096 bits registers (64x64-bit floats) • IBM • POWER with ViVA (Virtual Vector Architecture) 128 bits registers • Cell – SPE 128 bit registers • SUN • UltraSPARC with VIS 64-bit registers • NEC • SX-6/SX-9 (2008) - Earth simulator 1/2 with 4096 bit registers (used different ways) • up to 512 nodes of 8/16 cores (8192 cores) • each core • has 6 parallel instruction units • shares 72 4096-bit registers

Vector processors People use vector processing in many areas… • Scientific computing • Multimedia Processing (compression, graphics, image processing, …) • Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) • Lossy Compression (JPEG, MPEG video and audio) • Lossless Compression (Zero removal, RLE, Differencing, LZW) • Cryptography (RSA, DES/IDEA, SHA/MD5) • Speech and handwriting recognition • Operating systems (memcpy, memset, parity, …) • Networking (checksum, …) • Databases (hash/join, data mining, updates) • …

Vector processors • Instruction sets? • MMX • SSEx (several extensions) • AVX • AltiVec • 3DNow! • VIS • MDMX • FMA • …

Vector Instruction sets • MMX • MMX is officially a meaningless initialism trademarked by Intel; unofficially, • MultiMedia eXtension • Multiple Math eXtension • Matrix Math eXtension • Introduced on the “Pentium with MMX Technology” in 1998. • SIMD (Single Instruction, Multiple Data) computation processes multiple data in parallel with a single instruction, resulting in significant performance improvement; MMX gives 2 x 32-bit computations at once. • MMX defined 8 “new” 64-bit integer registers (mm0 ~ mm7), which were aliases for the existing x87 FPU registers – reusing 64 (out of 80) bits in the floating point registers.

Vector Instruction sets • SSE • Streaming SIMD Extensions (SSE) • SSE – Pentium II (1997) • SSE2 – Pentium 4 (Willamette)(2001) • SSE3 – Pentium 4 (Prescott)(2004) • SSE4.1 – Penryn (2006) • SSE4.2 – Nehalem (2008) • SIMD - 4 x 32-bit simultaneous computations (128-bit) • SSE defines 8 new 128-bit registers (xmm0 ~ xmm7) for single-precision floating-point computations. Since each register is 128-bit long, we can store total 4 of 32-bit floating-point numbers.

Vector Instruction sets • AVX • Advanced Vector Extentions (AVX) • AVX – Sandy Bridge (2011) • AVX2 – Haswell (2013) • SIMD - 8 x 32-bit simultaneous computations (256-bit) • AVX increases the width of the SIMD registers from 128-bit to 256-bit. It is now possible to store up to 8 x 32-bit floating point number. • AVX2 also introduced support for integer operations, vector shifts, gather support and three-operand fused multiply-accumulate (FMA3). • Next version of AVX is called AVX-512, will extend AVX to 512-bit, and is scheduled to be launched in 2015 with the next generation of Xeon Phi MIC processor (“Knights Landing”).

The End: Summary • Heterogeneous multi-core processors are alreadyeverywhere • Challenge: programming • Need to know the capabilities of the system • Different abilities in different cores • Memory bandwidth • Memory sharing efficiency • Need new methods to program the different components

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction