1 / 21

Why GPUs?

Explore the power, bandwidth, and design aspects of GPUs compared to CPUs, deciphering the challenges and solutions in computer architecture for enhanced performance and efficiency. Study the benefits of parallelism in processing on GPUs and the impact on varying memory types. Learn about data processing techniques, memory walls, and the evolution of chip design.

marcn
Download Presentation

Why GPUs?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why GPUs? Robert Strzodka

  2. Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics

  3. Data Processing in General lack of parallelism memory wall IN OUT memory memory OUT IN Processor

  4. Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow(200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis

  5. Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006  Sea change in chip design: multiple “cores” or processors per chip slide courtesy of Christos Kozyrakis

  6. Instruction-Stream-Based Processing instructions memory memory cache data data data data data data data Processor

  7. Instruction- and Data-Streams for(y=0; y<HEIGHT; y++) for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x]; } instuction stream processing data data streams undergoing a kernel operation inputStreams(A,B); outputStream(C); kernelProgram(OP_ADD); processStreams(); Addition of 2D arrays: C= A + B

  8. Data-Stream-Based Processing data data memory memory pipeline pipeline configuration Processor pipeline

  9. Architectures: Data – Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories

  10. Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics

  11. The GPU is a Fast, Parallel Array Processor Input Arrays: 1D, 3D,2D (typical) Output Arrays: 1D, 3D (slice),2D (typical) Vertex Processor (VP) Kernel changes indexregions of output arrays Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays Rasterizer Creates data streams from index regions Stream of array elements,order unknown

  12. Index Regions in Output Arrays • Quads and Triangles • Fastest option • Line segments • Slower, try to pair lines to 2xh, wx2 quads Output region Output region Output region • Point Clouds • Slowest, try to gather points into larger forms

  13. High Level Graphics Language for the Kernels • Float data types: • half 16-bit (s10e5), float 32-bit (s23e8) • Vectors, structs and arrays: • float4, float vec[6], float3x4, float arr[5][3], struct {} • Arithmetic and logic operators: • +, -, *, /; &&, ||, ! • Trignonometric, exponential functions: • sin, asin, exp, log, pow, … • User defined functions • max3(float a, float b, float c) { return max(a,max(b,c)); } • Conditional statements, loops: • if, for, while, dynamic branching in PS3 • Streaming and random data access

  14. CPU Input and output arrays may overlap GPU Input and output arrays must not overlap Input and Output Arrays Input Input Output Output

  15. CPU 1D input 1D output Higher dimensions with offsets GPU 1D, 2D, 3D input 2D output Other dimensions with offsets Native Memory Layout – Data Locality Output Input Color coded localityred (near), blue (far) Input Output

  16. CPU Arbitrary gather Arbitrary scatter GPU Arbitrary gather Restricted scatter Data-Flow: Gather and Scatter Input Input Output Output Input Output Input Output

  17. Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics

  18. 1) Computational Performance ATI R520 GFLOPS Note: Sustained performance is usually much lower and depends heavily on the memory system ! chart courtesy of John Owens

  19. CPU Large cache Few processing elements Optimized for spatial and temporal data reuse 2) Memory Performance Memory access types: Cache, Sequential, Random • GPU • Small cache • Many processing elements • Optimized for sequential (streaming) data access GeForce 7800 GTX Pentium 4 chart courtesy of Ian Buck

  20. 3) Configuration Overhead Configu- ration limited Compu- tation limited chart courtesy of Ian Buck

  21. Conclusions • Parallelism is now indispensable to further increase performance • Both memory and processing element dominated designs have pros and cons • Mapping algorithms to the appropriate architecture allows enormous speedups • Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)

More Related