330 likes | 439 Views
Mike Metzger mmetzg3@uic.edu MS - ECE. GP GPU Applications and Simulations. GPU History. Term coined in 1999 by NVIDIA with the release of the GeForce 256 Real “first” GPU was the 1985 Commodore Amiga: graphics coprocessor with a primitive instruction set
E N D
Mike Metzger mmetzg3@uic.edu MS - ECE GP GPU Applicationsand Simulations
GPU History • Term coined in 1999 by NVIDIA with the release of the GeForce 256 • Real “first” GPU was the 1985 Commodore Amiga: graphics coprocessor with a primitive instruction set • Modern day: GPUs are commercially available at low cost and prolific in computing systems • Primarily used (obviously) for displaying and rapidly altering video data which is inherently data-parallel • Architecture has become increasingly parallel over the last decade • GPUs are structured as SIMD machines to exploit high levels of DLP
GPU Capabilities • Large matrix/vector operations • Protein folding (molecular dynamics) modeling • FFT (signal processing) • Physics simulations • Sequence matching • Speech recognition • Database manipulation • Sort/search algorthims • Medical imaging
GPGPU Origins • C. Thompson, S. Hahn, M. Oskin: “Using Modern Graphics Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002 • Use the parallel architecture of GPUs to exploit data-level parallelism in common processes • Wrote framework for testing various data-heavy operations in C++, implemented through OpenGL programming interface • Run tests on Arithmetic, exponential, factorial, and multiplicative operation on large (10k-10 million member) vectors • Uses NVIDIA's GeForce 4 with 128 MB of VRAM (18 specialized cores), compare results to 1.5 GHz Pentium IV with 1 GB of RAM • No modifications to GPU or CPU hardware, test programs compiled with Microsoft Visual C++ 6
GeForce4 Architecture • Provides ISA for vertex programming – registers hold and process quad-valued FP numbers • Input and output attribute registers hold various graphical data • No access to main memory – 96 constant registers used instead (video memory can be filled pre-runtime by CPU) • 21 instructions available – mostly operate on all 4 input components • Vertex programs have an instruction limit of 128 [Thompson, MICRO 2002]
Programming Framework • C++ framework for general-purpose programs using vector operations implemented through OpenGL (GPU API) • Abstract the GPU functionality using C++ data types operated on by GPU assembly programs • DVector: vector class, allocated a buffer of video memory • DProgram: contains GPU assembly program written via array of strings • DFunction: contains a DProgram, input/output DVectors, bindings for constant registers ; executes the Dprogram and converts vectors to quad-value format, reduce CPU usage in scalar computation using quad-floats • Dsemaphore: object to stall CPU when waiting for GPU results [Thompson, MICRO 2002]
2002 results [Thompson, MICRO 2002]
2002 Results • Arithmetic vector operation – GPU ~6.4 times faster for large vectors • CPU run time doubles with doubled program complexity, GPU only triples with 12x program size • Matrix multiplication: GPU ~3.2 times faster • Boolean SAT: GPU ~2 times faster at large input sizes • Proved that GPUs can be used for general-purpose computation and will result in significant speedup for applications with DLP [Thompson, MICRO 2002]
Moving Forward – Stream Processing • I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group on Graphics and Interactive Techniques (SIGGRAPH), 2004. • In 2004, GPGPUs were becoming a legitimate tool, but there was no universal tool for programming a GPU to be used this way • Brook was an attempt by Stanford to create and share a stream programming model for GPGPU computing • Extension of C language to include data-parallel constructs: streams and kernels, allowing for SIMD-type operation • Stream: collection of data that can be opened in parallel • Kernel: special function built to operate upon streams, called with input and output stream(s) • Brook compiler maps this language to existing GPU APIs (specifically DirectX and OpenGL)
Brook Goals & Methods • Purpose: extend C to include data-parallel constructs for using a GPU as a stream processor • Uses streams (collection of data) and kernels (functions that operate on streams) to express DLP native to various applications • Improves arithmetic intensity by containing program computation within kernels • Implementation is void of explicit graphics constructs and thus capable of being used on any architecture and API (NVIDIA/ATI and DirectX/OpenGL) • Abstracts the GPU computing a higher level to remove need for knowledge of DirectX or OpenGL [Buck, SIGGRAPH, 2004]
Brook Implementation • Map kernels to Cg shaders, streams represented as floating point textures • Brook Runtime (BRT) library allows for input/output streams to be rendered to a display • Streams can be mapped to multiple textures to allow larger sizes than available on GPU architecture (2048x2048 or 4096x4096) • Use fragment processor to execute kernels over the streams present in textures: non-stream arguments passed via constant registers, apply shader compiler to create GPU assembly, map process to fragment shaders [Buck, SIGGRAPH, 2004]
Brook Performance Results (2004) • Compares optimized reference implementation to Brook DirectX and OpenGL variants • Normalized by CPU performance (black) [Buck, SIGGRAPH, 2004]
Brook Results (cont) • SAXPY: vector scaling and addition (y = ax + y) • SGEMV: matrix-vector product, scaled vector addition (y = nAx + my) • Segment: nonlinear diffusion-based region-growing algorithm, primarily used in medical image processing • FFT: fast Fourier transform, used in graphical post-processing • GPGPU performance increases with limited data reuse (SAXPY vs FFT) and increased arithmetic intensity (Segment vs SGEMV) • Brook implementations within 80% of hand-coded (optimized) versions • Important factor: read/write bandwidth – resulted in NVIDIA performing worse than ATI due to this difference (1.2 Gfloats/s vs 4.5 Gfloats/s) • Made available as open source code for GPGPU software developers [Buck, SIGGRAPH, 2004]
Moving forward – CUDA (2007) • CUDA: Compute Unified Device Architecture • Parallel computing architecture from NVIDIA, released in 2007 and compatible with GeForce 8 series and beyond (2006+) • As GPGPUs became popular, there was a need for a universal tool to access the virtual instruction set and parallel architecture of commercial GPUs • CUDA provides an API for software developers to usewith a public SDK • Modern GPU: GTX 690 has 3072 CUDA cores, 4096 MB of device memory, 6.9 billion transistors
Modern GPGPU uses • Arithmetic: matrix and vector operations • Modeling molecular dynamics (protein folding, etc) • FFT (signal processing, graphical post-processing) • Physics simulations and engines (ex: modern games) • Speech recognition • Medical imaging • Instruction Set Simulator • Parity-Check Decoding
ISA Simulator • S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L. Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010. • Improve current standards of processor simulation by exploiting parallelism available in GPGPUs • Accurate sequential simulators already exist (Cotson, m5, mpi-sim), much harder to efficiently simulate more complex environments • Two fields: high-performance (x86) and embedded (ARM) • CUDA threads simulate one or more cores, global memory provides a context structure and control logic for each simulated CPU • Written in C++ and CUDA, simulates both instruction sets on NVIDIA GTX 295 with Intel i7 running Linux • GTX 295: 2 GTX200 GPUs with 30 Streaming Multiprocessors (SM) including 240 stream processors, 938 MB VRAM
Instruction Set Simulator - ARM • Supports all non-Thumb ARM instructions • Functional blocks for Fetch, Decode, and Execute placed on CUDA model • Texture memory used to hold LUTs of instructions (working like a cache) – 16KB available • SPMD simulation allows CUDA threads to run concurrently, MIMD task-based applications sometimes become serialized if branches are data-dependent • 16 GP registers, status & auxiliary registers • Large matrix holds execution context for each processor [Raghav, HPCS, 2010]
Instruction Set Simulator - x86 • Simulator must support Intel IA-32 ISA • Context is held in 8 32-bit GP registers, 6 segment regs and various control registers • CISC architecture with complex decoding logic leads to some serialization: threads may branch to different functions/kernels depending on the parsed operation • Task-based parallel applications incur performance hit when branches are data dependent • CUDA concurrency is compromised by variable length instructions [Raghav, HPCS, 2010]
ISS Testing & Results • Best Case (BC): application has SIMD DLP, same kernel is running on different data subsets, all cores fetch same instructions • Worst Case (WC): application has task-level parallelism (MIMD), cores may operate on different data sets, cores diverge in instruction retrieval due to data dependent branches • Single Kernel (SK): entire ISS run in one CUDA kernel, components simulated in successive steps of one function • Multiple Kernels (MK): system components modeled in separate CUDA kernels, requires many memory tranfers of device state when kernels swap & launch • ARM performance dependent upon kernel swaps (SK vs MK) • X86 performance dependent upon application type (BC vs WC) [Raghav, HPCS, 2010]
Simulation Results SK MK WC [Raghav, HPCS, 2010] BC
Simulation Results SK MK WC [Raghav, HPCS, 2010] BC
ISS Testing – Real Workloads • Test the ISS using real workloads to see if theoretical speedup is possible with real applications • Matrix Multiplication, IDCT, FFT • Use parallelization scheme like OpenMP to distribute workload • Static loop parallelization: identical # of consecutive iterations are assigned to parallel threads • Processor ID determines which dataset to use (HW2/3 scheme B) • Stack-allocated variables determine lower and upper bounds of functional loops • Simulation speedup – speedup relative to serial simulation of varying number of cores • Application speedup – speedup of parallel simulation over a single simulated core [Raghav, HPCS, 2010]
Speedup Results • Takeaway: architecture is scalable to and beyond 1000 cores • ~500-1000x speedup for best case scenarios (near ideal 1024) [Raghav, HPCS, 2010]
Parallel Nonbinary LDPC Decoding • G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, J. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. • Low-Density Parity-Check Codes (LDPC) are error-correcting codes over a Galois (or finite) field • Finite Field: commutative ring in abstract algebra containing multiplicative inverse for every non-zero element • Current implementations of LDPC decoding algorithms have poor flexibility & scalability • Complexity of LDPC decoding algorithms increases greatly going from binary to nonbinary codes (with q>2 for GF(q)) • Goal: create a highly parallel and flexible decoder supporting different code types, variable code lengths, and the ability to run of various devices • Use OpenCL to employ a SIMT model to exploit LDPC decoding's inherent DLP
LDPC Decoding – Nonbinary LDPC • Parity check matrix H (spare q-ary MxN matrix) with elements defined in a Galois field GF(q) • Can be represented by a Tanner graph: each row of H → check node, each column of H → variable node • M(n) is the set of check nodes for variable node nN(m) is the set of variable nodes for check node m • Row weight of a check node = dc • Belief Propagation (BP) algorithm is one of the best decoding algorithms, this implementation uses the Min-Max approximation algorithm to exploit DLP [Wang, ASILOMAR, 2012]
Implementation – Complexity Analysis • Computation kernels of nonbinary (q>2) LDPC becomes more complex for check node processing (O(dc*q2) vs O(dc)) • CNP and VNP take up 91.64% and 6.43% of serial runtime respectively [Wang, ASILOMAR, 2012]
Implementation – Algorithm Mapping • Develop work flow of decoding process • Computation is all done on GPU to keep intermediate messages in device memory • Use 5 OpenCL kernels to exploit DLP, distribute effectively • Work items (q) become CUDA threads, work groups (M) become CUDA thread blocks: all have the same computation path and memory access patterns [Wang, ASILOMAR, 2012]
Implementation – Nonbinary Arithmetic & Efficient Data Structures • Addition and subtraction of nonbinary elements achieved through XOR operations • Use LUT with expq & logq for multiplication & division:a*b = expq[(logq[a]+logq[b]+q-1)%(q-1)]a/b = expq[(logq[a]+logq[b]+q-1)%(q-1)] • Expq and logq are used frequently → keep in local memory • Compress H horizontally & vertically to create more efficient structure [Wang, ASILOMAR, 2012]
Implementation – Accelerating Forward-Backward Algorithm in CNP • Original algorithm shown has O(qdc), revised has O(dc*q2) • Forwarded messages vector Fi(a) stored in local memory, updated by q work items in parallel for each stage (i) • Use a barrier function after each stage for synchronization • Requires 2*sizeof(cl_float)*q*dc of local memory • 1.5KB for (3,6)-regular GF(32)(used in this implementation) [Wang, ASILOMAR, 2012]
Implementation – Coalescing GlobalMemory Access • Rm,n(a) and Qm,n(a) are complex 3D structures located in global memory • Arrange in [N,q,M] format rather than [M,N,q] so that q work items always access data stored contiguously • Enables coalesced memory access → ~4-5x speedup [Wang, ASILOMAR, 2012]
Nonbinary LDPC Decoding - Results • Run on 2 CPUs and 1 GPU:Intel i7-640LM (dual core, 2.93 GHz)AMD Phenom II X9-940 (quad core, 2.9 GHz)NVIDIA GTX470 (448 stream processors, 1.215 GHz, 1280MB device memory) • 2.47 speedup for OpenCL over serial C on Intel i76.67 speedup for OpenCL over serial C on AMD Phenom II • GPU has 69.92 speedup over Intel i7 and 33.46 over AMD Phenom IIWorse case speedups: 38.48 and 18.41 [Wang, ASILOMAR, 2012]
Nonbinary LDPC Decoding - Results • GPU algorithm had 693.5 Kbps throughput and 1260 Kbps throughput with early termination • Nonbinary decoders have complexity of 2q2 ~ 3q2 higher thanbinary decoders • With q=32 in the samples run, this results in a 2000~3000x increase in complexity • Due to massive parallelization in the decoding algorithm and the GPU, the gap between binary and nonbinary implementation is reduced to 50x • This type of LDPC decoding (with short codewords and high GF(q) values) is most common in LDPC research & application, although better speedups & throughput values are found in this implementation with longer codewords & lower GF(q) values [Wang, ASILOMAR, 2012]
Citations • Chris J. Thompson, Sahngyun Hahn, Mark Oskin: “Using Modern Graphics Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002. • Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group for Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, Aug. 2004. • Shivani Raghav, Martino Ruggiero, David Atienza, Christian Pinto, Andrea Marongiu, Luca Benini: “Scalable Instruction Set Simulator for Thousand-core Architectures Running on GPGPUs.” Proceedings of High Performance Computing and Simulation (HPCS), pp. 459-466, France, June/July 2010. • Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU.” 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.