Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

Graphic Processors (GPUs) • Modern computers are graphics intensive • Advanced 3D graphics require a significant amount of computation Graphics Card (Source: nvidia.com) Solution: Add a Graphics Processor (GPU)

High-Performance Computing (HPC) Efficient use of computers for computationally intensive problems in science or engineering Computational Computer Architecture Dynamic Molecular Simulation Weather forecast Climate modeling High-Performance Computing Processing Demand Third dimension: Main Memory Capacity Office Applications Communication Demand General Purpose Programming on GPUs (GPGPU)

Outline • GPU Evolution • GPU Programming • GPU Architecture • Achieving High GPU Performance • Future Trends • Conclusions

GPU Evolution

First GPUs: Fixed Hardware Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations [Blythe 2008]

Programmable Shaders Motivation: More flexible graphics processing Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations

GPGPU with Programmable Shaders Use Graphics Library to gain access to GPU Use color values to code data Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations Effect of fixed function stages must be accounted for

Functional Unit Utilization Vertex Processing Fragment Processing Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations

Functional Unit Utilization Vertex Processing Fragment Processing Vertex intensive shader Fragment intensive shader Unified shader

Unified Shader Architecture Thread Scheduler • Exploit parallelism • Data parallelism • Task parallelism • Data parallel processing (SIMD/SIMT) • Hide memory latencies • High bandwidth SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Memory Memory Memory Memory Interconnect On-Chip Memory or Cache Off-ChipDRAM Memory Architecture naturally supports GPGPU

Gpu Programming

GPGPU Tool Support Sh GPU++ OpenCL Accelerator PeakStream CUDA Programmable Shaders Unified Shaders GPU papers on Supercomputing 3 1 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Compute Unified Device Architecture (CUDA) • Most code is normal C++ code • Code to run on GPU organized in kernels • CPU sets up and manages computation __global__ voidvector_add(float* a, float* b, float* c) { intidx = threadIdx.x; c[idx] = a[idx] + b[idx]; } intmain() { int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ... }

Thread/Data Organization Grid • Hierarchical thread organization • Grid • Block • Thread • A block can have a maximum of 512 threads • 1D, 2D and 3D mappings possible Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Grid Block (0) Block (1)

Vector Addition Example GPU CPU Main Memory Global Memory A A A SP SP SP SP B B B SP SP SP SP C C C Local Memory A collection of concurrently processed threads is called a warp

Terminology: Warp

Vector Addition Profile • Only 11% of GPU time is used to add vectors • The arithmetic intensity of the problem is too low • Overlapping data copy and computation could help Hardware: NVIDIA MVS 3100M

Will GPUs Save the World? • Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010] • GPGPU has provided nice speedups for problems that fit the architecture • Metric challenge: The practitioner needs performance per developer hour

GPU Architecture

NVIDIA Tesla Architecture Figure reproduced from [Lindholm et al.; 2008]

Control Flow IF • The threads in a warp share use the same instruction • Branching is efficient if all threads in a warp branch in the same direction • Divergent branches within a warp cause serial execution of both paths Condition True Threads Condition True Threads Condition False Threads Condition False Threads

Modern DRAM Interfaces Rows Banks • Maximize bandwidth with 3D organization • Repeated requests to the row buffer are very efficient DRAM Row address Columns Columnaddress Row Buffer

Addr 112 Access Coalescing Addr 116 Transaction Addr 120 Addr 124 • Global memory accesses from all threads in a half-warp are combined into a single memory transaction • All memory elements in a segment are accessed • Segment size can be halved if only the lower or upper half is used Thread 0 Addr 128 Thread 1 Addr 132 Thread 2 Addr 136 Thread 3 Addr 140 Transaction Thread 4 Addr 144 Thread 5 Addr 148 Thread 6 Addr 152 Thread 7 Addr 156 Assumes Compute Capability 1.2 or higher

Bank Conflicts Bank 0 • Memory banks can service requests independently • Bank conflict: more than one thread access a bank concurrently • Strided access patterns can cause bank conflicts Bank 1 Thread 0 Bank 2 Thread 1 Bank 3 Thread 2 Bank 4 Thread 3 Bank 5 Thread 4 Bank 6 Thread 5 Bank 7 Thread 6 Stride two accesses gives 2-way bank conflict Thread 7

NVIDIA Fermi • Next generation computing chip from NVIDIA • Aims to alleviate important bottlenecks • Improved double precision floating point support • Cache hierarchy • Concurrent kernel execution • More problems can be solved efficiently on a GPU Figure reproduced from [NVIDIA; 2010]

Achieving high GPU performance

Which problems fit the GPU model? • Fine-grained data parallelism available • Sufficient arithmetic intensity • Sufficiently regular data access patterns It’s all about organizing data Optimized memory system use enables high performance

Increase Computational Intensity A × B = C • Memory types: • On-chip shared memory: Small and fast • Off-chip global memory: Large and slow • Technique: Tiling • Choose tile size such that it fits in the shared memory • Increases locality by reducing reuse distance Reuse! Reuse! × =

Memory Layout A × B = C • Exploit coalescing to achieve high bandwidth • Linear access necessary • Solution: Tiling Coalesced Not Coalesced × = Assume row-major storage

Avoid Branching Inside Warps Assume 2 threads per warp W1 W2 W3 W4 W1 W2 W3 W4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 4 4 4 4 8 8 All iterations diverge One iteration diverges

Automation • Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08] • Avoid saturation • Sweet spot will vary between devices • Sweet spot varies with problem sizes • Auto-tuning 3D FFT [Nukada et al.; SC2009] • Balance resource consumption vs. parallelism with kernel radix and ordering • Best number of thread blocks chosen automatically • Inserts padding to avoid shared memory bank conflicts

Case Study: Dynamic Molecular Simulation with NAMD Simulate the interactionof atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

Key Performance Enablers • Careful division of labor between GPU and CPU • GPU: Short range non-bonded forces • CPU: Long-range electrostatic forces and coordinate updates • Overlap CPU and GPU execution through asynchronous kernel execution • Use event recording to track progress in asynchronously executing streams [Phillips et al., SC2008]

CPU/GPU Cooperation in NAMD Remote Local GPU x f f Remote Local Local Update CPU f x f x Time [Phillips et al., SC2008]

Challenges • Completely restructuring legacy software systems is prohibitive • Batch processing software are unaware of GPUs • Interoperability issues with pinning main memory pages for DMA [Phillips et al., SC2008]

Future Trends

Accelerator Integration • Industry move towards integrating CPUs and GPUs on the same chip • AMD Fusion [Brookwood; 2010] • Intel Sandy Bridge (fixed function GPU) • Are other accelerators appropriate? • Single-chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010] AMD Fusion Reproduced from [Brookwood; 2010]

Vector Addition Revisited Start-up and shut-down data transfers are the main bottleneck Fusion eliminates these overheads by storing values in the on-chip cache Using accelerators becomes more feasible

Memory System Scalability • Current CPU bottlenecks: • Number of pins on a chip grows slowly • Off-chip bandwidth grows slowly • Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand • Conflicting requirements: • GPU: High bandwidth, not latency sensitive • CPU: High bandwidth, can be latency sensitive

conclusions

Conclusions • GPUs can offer a significant speedup for problems that fit the model • Tool support and flexible architectures increases the number of problems that fit the model • CPU/GPU on-chip integration can reduce GPU start-up overheads

Thank You Visit our website: http://research.idi.ntnu.no/multicore/

References • Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU;Lee et al.; ISCA; 2010 • Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010 • NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA; 2010 • AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White Paper; AMD; 2010 • Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master Thesis; NTNU; 2010 • Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures; Yuan et al.; MICRO; 2009 • Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009 • Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009 • Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips et al.; SC; 2008 • Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008 • NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE Micro; 2008 • Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA; Ryoo et al.; PPoPP; 2008

Extra Slides

Complexity-Effective Memory Access Scheduling • On-chip interconnect may interleave requests from different thread processors • Row locality is destroyed • Solution: Order-preserving interconnect arbitration policy and in-order scheduling Queue: Req 0 Row A Req 1 Row A Req 0 Row B Req 0 Row B Req 1 Row A Req 1 Row B Out-of-order Scheduling Req 0 Row A Req 1 Row A Row Switch Req 0 Row B Req 1 Row B In-order Scheduling Req 0 Row A Req 1 Row A Row Switch Req 0 Row B Req 1 Row B Req 0 Row A Row Switch Req 0 Row B Row Switch Req 1 Row A Row Switch Req 1 Row B Time Performance of out-of-order scheduling with less complex in-order scheduling [Lee et al., MICRO2009]

Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre