210 likes | 218 Views
Explore the power, bandwidth, and design aspects of GPUs compared to CPUs, deciphering the challenges and solutions in computer architecture for enhanced performance and efficiency. Study the benefits of parallelism in processing on GPUs and the impact on varying memory types. Learn about data processing techniques, memory walls, and the evolution of chip design.
E N D
Why GPUs? Robert Strzodka
Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics
Data Processing in General lack of parallelism memory wall IN OUT memory memory OUT IN Processor
Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow(200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis
Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Sea change in chip design: multiple “cores” or processors per chip slide courtesy of Christos Kozyrakis
Instruction-Stream-Based Processing instructions memory memory cache data data data data data data data Processor
Instruction- and Data-Streams for(y=0; y<HEIGHT; y++) for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x]; } instuction stream processing data data streams undergoing a kernel operation inputStreams(A,B); outputStream(C); kernelProgram(OP_ADD); processStreams(); Addition of 2D arrays: C= A + B
Data-Stream-Based Processing data data memory memory pipeline pipeline configuration Processor pipeline
Architectures: Data – Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories
Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics
The GPU is a Fast, Parallel Array Processor Input Arrays: 1D, 3D,2D (typical) Output Arrays: 1D, 3D (slice),2D (typical) Vertex Processor (VP) Kernel changes indexregions of output arrays Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays Rasterizer Creates data streams from index regions Stream of array elements,order unknown
Index Regions in Output Arrays • Quads and Triangles • Fastest option • Line segments • Slower, try to pair lines to 2xh, wx2 quads Output region Output region Output region • Point Clouds • Slowest, try to gather points into larger forms
High Level Graphics Language for the Kernels • Float data types: • half 16-bit (s10e5), float 32-bit (s23e8) • Vectors, structs and arrays: • float4, float vec[6], float3x4, float arr[5][3], struct {} • Arithmetic and logic operators: • +, -, *, /; &&, ||, ! • Trignonometric, exponential functions: • sin, asin, exp, log, pow, … • User defined functions • max3(float a, float b, float c) { return max(a,max(b,c)); } • Conditional statements, loops: • if, for, while, dynamic branching in PS3 • Streaming and random data access
CPU Input and output arrays may overlap GPU Input and output arrays must not overlap Input and Output Arrays Input Input Output Output
CPU 1D input 1D output Higher dimensions with offsets GPU 1D, 2D, 3D input 2D output Other dimensions with offsets Native Memory Layout – Data Locality Output Input Color coded localityred (near), blue (far) Input Output
CPU Arbitrary gather Arbitrary scatter GPU Arbitrary gather Restricted scatter Data-Flow: Gather and Scatter Input Input Output Output Input Output Input Output
Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics
1) Computational Performance ATI R520 GFLOPS Note: Sustained performance is usually much lower and depends heavily on the memory system ! chart courtesy of John Owens
CPU Large cache Few processing elements Optimized for spatial and temporal data reuse 2) Memory Performance Memory access types: Cache, Sequential, Random • GPU • Small cache • Many processing elements • Optimized for sequential (streaming) data access GeForce 7800 GTX Pentium 4 chart courtesy of Ian Buck
3) Configuration Overhead Configu- ration limited Compu- tation limited chart courtesy of Ian Buck
Conclusions • Parallelism is now indispensable to further increase performance • Both memory and processing element dominated designs have pros and cons • Mapping algorithms to the appropriate architecture allows enormous speedups • Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)