550 likes | 866 Views
The Fundamentals of GPU Technology and CUDA Programming. Nicholas Lykins Kentucky State University May 7, 2012. Outline. Introduction Why pursue GPU accelerated computing? Performance figures Historical background Graphics rendering pipeline History of GPU technology
E N D
The Fundamentals of GPU Technology and CUDA Programming Nicholas Lykins Kentucky State University May 7, 2012
Outline • Introduction • Why pursue GPU accelerated computing? • Performance figures • Historical background • Graphics rendering pipeline • History of GPU technology • NVIDIA and GPU implementations • Alternative GPU processing frameworks • CUDA • Background and available libraries • Terminology • Architectural design • Syntax • Hands-on CUDA sample demonstration • Line by line illustration of code execution • Animated execution pipeline for sample application • Conclusion and future outlook
Thesis Guidelines • Initial goal: Demonstrate the potential for GPU technology to further enhance data processing needs of the scientific community. • Objectives • Deliver an account of the history of GPU technology • Provide an overview of NVIDIA’s CUDA framework • Demonstrate the motivation for scientists to pursue GPU acceleration and apply it to their own scientific disciplines
High-Performance Computing • Multi-Core Processing • GPU Acceleration • ….How are they different? • Hardware differences: CPU vs. GPU
Hardware Review • CPU (Single-Input, Single-Data) • Control unit, arithmetic and logic unit, internal registers, internal data bus • Speed limitations • One bit in, one bit out • GPU (Single-Input, Multiple-Data) • Many processing cores and onboard memory • Parallel execution of each core • One bit in, multiple bits out
Performance Trends • GPU processing time is measurably faster than comparable CPU processing time when working with large-scale input data.
GPU Technology –Pipeline Overview • Graphics rendering pipeline • Entire process through which an image is generated by a graphics processing device • Vertex calculations • Color generation • Shadows and lighting • Shaders • Specialized program executed as a function of graphics processing hardware to produce a particular aspect of the resulting image
Traditional Pipelining Process • Traditional pipelining process • System collects data to be graphically represented • Modeling transformations within the world space • Vertices are “shaded” according to various properties • Lighting, materials, textures • Viewing transformation is performed – reorienting the graphical object with respect to the human eye • Clipping is performed, eliminating constructed content outside the frustum
Traditional Pipelining, Continued • The three-dimensional scene is then rendered onto a two-dimensional viewing plane, or screen space • Rasterization takes place, in which the continuous geometric representation of objects is translated into a set of discrete fragments for a particular display • Color, transparency, and depth • Stored within the frame buffer, where Z-buffering and alpha blending take place, pixels are determined with respect to their appearance on the screen.
Graphics Processing APIs (Application Programming Interfaces) • OpenGL • OpenGL 1.0 first developed by Silicon Graphics in 1992. • First middle layer developed for interpreting between operating system and underlying hardware. • Industry-wide standard was implemented for graphics development, with each vendor crafting their hardware architecture with those standards in mind. • Cross-platform compatibility • DirectX • Developed by Microsoft employees Craig Eisler, Alex St. John, and Eric Engstrum in 1995, for facilitating low level access by programmers of Window’s restricted memory space. • Set of related APIs (Direct3D, DirectDraw, DirectSound) that enable multimedia development. • Vendor provides device driver that enables compatibility for its own hardware across all Windows systems. • Restricted to Windows only.
GeForce 256 • Released in August of 1999, it was the world’s first official GPU device. • Integration of all graphics processing actions onto a single chip. • Implemented with a fixed function rendering pipeline
Programmable Pipeline • OpenGL 2.0 • Programmable shaders • Programmers could write unique instructions for accessing hardware functionality • Programmability enabled by proprietary “shading languages” • ARB • Low-level assembly based language for directly interfacing with hardware elements • Unintuitive and difficult to use effectively • GLSL (OpenGL Shading Language) • High-level language derived from C • Translates high-level code into corresponding low-level instructions to be interpreted as ARB language • Cg • High-level shader language designed by NVIDIA • Compiles into assembly-based and GLSL code for interpretation by OpenGL and DirectX
GT80 Architecture • Released in November of 2006, first implemented within the GeForce 8800. • First architecture to implement the CUDA framework, and first instance of a unified graphics rendering pipeline • Vertex and fragment shaders integrated as one hardware component • Programmability given over individual processing elements on the device • Scalability based on targeted consumer market • Proportions of processing cores, memory, etc.
GT80 Architecture, Continued • GeForce 8800 GTX • Each “tile” represents a separate multiprocessor • Eight streaming cores per multiprocessor, 16 multiprocessors per card • Shared L1 cache per pair of tiles • Texture handling units attached to each tile • Recursive method for handling graphics rendering • Output data for one core becomes input data for another • Six discrete memory partitions, each 64-bit, totalling to a 384-bit interface. • Bit interface and memory size varies based on specific GT80 device.
Fermi Architecture • Second generational GPU architecture, released in June of 2008. • Most recently featured architecture until the Kepler architecture was published, in March of 2012. • Rebranding of streaming processor cores as CUDA cores. • Overall superior design in terms of performance and computational precision
Fermi Architecture, Continued • Core count of 240, increased to 512. • 32 cores per multiprocessor, totaling 16 streaming multiprocessors • Similar memory interface to the GT80, hosting six 64-bit memory partitions totalling a 384-bit memory interface. • 64 KB shared memory per streaming multiprocessor
Fermi Architecture, Continued • Unified memory address space: Thread, block, globally layered. • Enables a read and write mechanism compatible with C++ via pointer handling. • Configurable shared memory: 48 KB shared, 16 KB as L1 cache, vs. 48 KB L1 cache and 16 KB shared memory • L2 cache common across all streaming multiprocessors
Fermi Architecture, Continued • Added CUDA compatibility with the implementation of PTX (Parallel Thread Execution) 2.0 • Low level equivalent of assembly language • Low level virtual machine, responsible for translating system calls from the CPU, to hardware instructions interpretible by the GPU’s onboard hardware. • CUDA passes high level CUDA code to the compiler. • PTX translates it into corresponding low level code. • Hardware instructions are then interpreted based on that low level code and executed by the GPU itself.
AMD-ATI • Rival GPU manufacturer – develops its own proprietary line of graphics cards • Significant architectural differences with NVIDIA products • Evergreen chipset – ATI Radeon HD 5870 - Comparison • NVIDIA’s GTX 480 – 512 active cores, 3 billion transistors • Radeon HD 5870 – 20 parallel engines – 16 cores – 5 processing elements – totalling 1600 work units, 2.15 billion transistors
Parallel Computing Frameworks • OpenCL • Parallel computing framework similar to CUDA • Initially introduced by Apple, but development of its standards currently done by the Khronos Group • Emphasis on portability and cross-platform implementations • Flagship parallel computing API of AMD • CPU/GPU, Apple systems, GPUs, etc. • Adopted by Intel, AMD, NVIDIA, ARM Holdings • CTM (Close to Metal) • Released in 2006 by AMD as a low level API providing hardware access, similar to NVIDIA’s PTX instruction set. • Discontinued in 2008, replaced by OpenCL for principal usage
CUDA • Programming framework by NVIDIA for performing GPGPU (General-Purpose GPU) computing • Potential for applying parallel processing capabilities of GPU hardware to traditional software applications • NVIDIA Libraries • Ready-made libraries for implementing complex computational functions • cuFFT (NVIDIA CUDA Fast Fourier Transform), cuBLAS (NVIDIA CUDA Basic Linear Algebra Subroutines), and cuSPARSE (NVIDIA CUDA Sparse)
Terminology - What is CUDA? • Hardware or software? ……or both. • Development framework that correlates between hardware elements on the GPU and the algorithms responsible for accessing and manipulating those elements • Expands on its original definition as a C-compatible compiler which special extensions for recognizing CUDA code
Scalability Model • Resource allocation is dynamically handled by the framework. • Scalable to different hardware devices without the need to recode an application. o
Encapsulation and Abstraction • CUDA is designed as a high level API, hiding low level hardware details from the user. • Three major layers of abstraction between the architecture and the programmer: • Hierarchy of thread groups, shared memories, barrier synchronization • Computational features implemented as functions. Input data passed as parameters. • High level functionality allows for a low learning curve in terms of use. • Allows for applications to be run on any GPU card with a compatible architecture. • Backwards compatible for older versions.
Threading Framework • Resource allocation handled through threading • A thread represents a single work unit or operation • Lowest level of resource allocation for CUDA • Hierarchical structure • Threads, blocks, and grids; from lowest to highest • Paralleled to multiple layers of nested execution loops
Threading Framework, Continued • Visual representation of thread hierarchy • Multiple threads embedded in blocks, multiple blocks embedded in grids • Intuitive schemefor understanding allocation mechanism
Threading Framework, Continued • Threading syntax • Recognized by the framework for handling thread usage within an application. • Each variable provides for tracking and monitoring of individual thread activity. • Resource assignment for an application not covered by these syntax elements.
Threading Framework, Continued • Keywords • Threadidx .x/y/z– Represents the number of threads within a given block, three-dimensional. • blockIdx.x/y/z – Refers to a particular block within a grid, three-dimensional. • blockDim.x/y/z – Total number of threads allocated along a single dimension of a block, three-dimensional. • gridDim.x/y/z – Block count per dimension, three-dimensional • tid – Identifying marker for each individual thread; a unique value for each allocated thread
Threading Framework, Continued • Flexibility for managing threads within an application. • Example: inttid = threadIdx.x + blockIdx.x * blockDim.x • Current block number, multiplied by number of threads per block, added to the current thread count. • Thread IDs are managed in this equation by mapping each value on a per-thread basis. • Simultaneous implementation of all thread IDs. • Parallel mapping of the equation across all threads as opposed to one thread at a time.
Sample Thread Allocation • blockDim.x = 4 • blockIdx.x = {0, 1, 2, 3…} • threadIdx.x = {0, 1, 2, 3}…{0, 1, 2, 3}… • idx/tid = blockDim.x * blockIdx.x + threadIdx.x • Problem size of ten operations, so two threads go to waste.
Thread Incrementation • Current scheme handles thread execution, but not subsequent incrementation of thread IDs. • Right and wrong way to increment threads, to avoid overflow into other allocated IDs. • Increment based on grid dimensions, not on block and thread counts • Example: • tid += blockDim.x * gridDim.x • Thread ID incremented by a multiple of threads per block and blocks per grid.
Compute Capability • Indicates structural limitations of hardware architectures • Determines various technical thresholds such as block and thread ceilings, etc. • Revision 1.x – Pre-Fermi architectures • Revision 2.x – Fermi architecture
Serial vs. Parallel Distinction • Host memory vs. device memory • Each platform has a separate memory space • Host can read and write to host only, device can read and write to device only • Synchronization needed between CPU and GPU activity • GPU only handles computationally intensive calculations – CPU still executes serial code
Serial vs. Parallel Execution Model • Application pipeline • Represents CPU and GPU activity • Illustrates behavior of application, and invocation of GPU computations
Memory Architecture –Conceptual Overview • Three address spaces • Localized memory • Unique to each thread • Shared memory • Shared among threads within a particular block • Global memory • Accessible by threads and blocks across a given grid
Memory Architecture –Hardware Level • More accurate representation of hardware level interaction between address spaces • Two new spaces: constant memory and texture memory • Constant memory is read-only and globally accessible. • Texture memory is a subset of global memory, useful in graphics rendering • Two-dimensionality • Surface memory • Similar functionality to texture memory but different technical elements
Memory Allocation • Three basic steps of the allocation process • 1. Declare host and device memory allocations • 2. Copy input data from host memory to device memory • 3. Transfer processed data back to host upon completion • Bare memory requirements for successfully executing a GPU application • More sophisticated memory functions exist, but are geared towards more complex functionality and better performance
Memory Handling Syntax • CUDA-specific keywords for dynamically allocating memory • cudaMalloc– Allocates a dynamic reference to a location in GPU memory. Identical in function to mallocin C. • cudaMemCpy– Transfers data from CPU memory to GPU memory. Also responsible for reversing the transfer. • cudaFree – Deallocates reference to GPU memory location. Identical to free in C. • Basic syntax needed for handling memory allocation • Additional features available for more sophisticated applications
Kernels • Kernel – Executes processing instructions for data loaded onto the GPU • Executes an operation N times for N threads simultaneously • Structured similarly to a normal function, but with its own unique changes • Kernel syntax • __global__ void example1<<<M, N>>>(A, B, C) • __global__ - Declaration specifier identifying a line as a GPU kernel. • Void example1 – Return type and kernel name • <<<M, N>>> - M represents number of threads to be allocated per block. N indicates number of blocks to set aside for executing the kernel. • (A, B, C) – Argument list to be passed to the kernel
Warps • During kernel execution, threads organized into warps. • A warp is a grouping of 32 threads, all executed in parallel with one another. • Threads are executed at the same program address, but mapped onto its own instruction counter and register state. • Allows parallel execution, but independent pacing of each thread in terms of completion. • Handling of threads in a warp is managed by a warp scheduler. • Two warp schedulers available per streaming multiprocessor • Warp execution optimized if no data dependence between threads. • Otherwise, dependent threads remain disabled till required data is received from completed operations
Thread Synchronization • Separation of threads between warps can cause data to get “tangled”. • Completed data does not coalesce back in memory as it should due to out of order warp execution. • Problem avoided by using __syncthreads() • Forcibly halts continued execution of a thread batch until all threads in a warp have reached completion. • Minimizes idle time for threads that finish early and ensures fewer errors in sensitive computations
Sample Execution • Animated visualization, indicating the relation between CPU and GPU elements • Sample code obtained from:Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011. • Highlights the activities needed to facilitate completion of a GPU-based data processing application. • Code Animation Link
Conclusion • Major topics covered: • Performance benefits of GPU accelerated applications. • Historical account of GPU technology and graphics processing. • Hands-on demonstration of CUDA, including syntax, architecture, and implementation.
Future Outlook • Promising future, with positive projected market demand for GPU technology • Growing market share for NVIDIA products • Gaming applications, scientific computing, and video editing and engineering purposes • Release of Kepler architecture – March 2012. • Indicates further increase in performance metrics and optimized resource consumption • Currently little documentation released in terms of technical specifications • Role of GPU technology is sure to continue saturating the professional market, as it’s capabilities continue to rise.
Bibliography • 1. Meyers, Michael.Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs. s.l. : McGraw-Hill Osborne Media, 2010. • 2. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2012.] http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/48210-intel-sandy-bridge-e-core-i7-3960x-cpu-review-3.html. • 3. Intel Corporation. Intel AVX. [Online] [Cited: February 21, 2012.] http://software.intel.com/en-us/avx/. • 4. Performance Analysis of GPU compared to Single-core and Multi-core CPU for Natural Language Applications. Gupta, Shubham and Babu, M. Rajasekhara. 5, 2011, International Journal of Advanced Computer Science and Applications, Vol. 2, p. 4. • 5. IAP 2009 CUDA @ MIT / 6.963. [Online] January 2009. [Cited: February 7, 2012.] https://sites.google.com/site/cudaiap2009/. • 6. Palacios, Jonathan and Triska, Josh. A Comparison of Modern GPU and CPU Architectures: And the Common Convergence of Both. [Online] March 15, 2011. [Cited: February 21, 2012.] http://web.engr.oregonstate.edu/~palacijo/cs570/final.pdf. • 7. NVidia.NVidia's Next Generation CUDA Compute Architecture: Fermi. nvidia.com. [Online] 2009. [Cited: February 21, 2012.] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. • 8. —. NVidia CUDA C Programming Guide 4.1. NVidia. [Online] November 18, 2011. [Cited: February 17, 2012.] http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.