GPU Programming Ryan Holsman & Robert Mosher

GPU ProgrammingRyan Holsman & Robert Mosher

CUDA Overview • Very High Level • Gives access to the GPU via C code • You pass in the argument of the number of elements that will receive the function and the CUDA driver addresses the rest.

CUDA Approach

CUDA Memory Model • Device code can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory • Host code can – R/W per grid global and constant memories

Recommended Algorithms • Embarrassingly Parallel • Large Matrix Operations • Image and Video Processing • Bioinformatics • Massive Dataset and High Throughput Problems

Interoperability of Code • Programming in CUDA allows for some hardware abstraction • CUDA API allows for detection of GPU configuration and graphics memory • Just-In-Time Compiling • Automatic Program Scalability

Tesla GPGPUs “The NVIDIA Tesla™ C2050 and C2070 Computing Processors fuel the transition to parallel computing and bring the performance of a small cluster to the desktop.” • Target the workstation and cluster market • Offer ECC Memory Protection • 1.03 Tflops per card • “make supercomputing available to everyone.”

Visual Super Computing • Workstation graphics cards, Quadro or FireGL • Most Quadro cards use the same processing cores as a non-quadro equivalent. But apply an extra set of algorithms that put data accuracy over processing speed. • Quadro cards usually provide a higher level of memory for working with larger data sets.

Proprietary Physics Engine • Designed to be a middleware engine by Ageia • Purchased by Nvidia in February 2008 • PhysX originally ran on its own PPU card • The PPU was a Multi-core MIPS architecture based device • Nvidia ported over PhysX software to run on any CUDA enabled card • CUDA Requires 32 or more cores and 256 MB or more of RAM

PhysX Software • Autodesk Suite (3ds Max, Maya, Softimage) • DarkBASIC Pro • DX Studio • Microsoft Robotics Studio • OGRE – Open source rendering engine • The Physics Abstraction Layer

GPU Clusters

Super Computers • Most super computers currently cost $50-$250 per core, Architecture Dependent • But a Nvidia GTX 580 has over 512 computational units for around $500 • GPU’s must do scientific calculations using simple linear operations and are much less complex. • GPU’s having become great for designing Artificial Neural Networks; with the appropriate methodology.

The HPU4Science • Uses master/worker design • 3 workers each with 3-4 high performance GPU’s • Built on a $40,000 Budget • Master is a $7,000 server level system

Master Specs • One thread per worker • Currently supports 12 workers and 16 threads • Dual Xeon 5000 CPU • Single GTX 295 (Dual GPU single card)

Workers • Core i7 Processors • 12 GB Memory Triple Channel Memory • 4 x 1.5GB GTX 580 • Worker 2 has 3 GTX 480’s • Worker 3 has 3 GTX 285’s • Worker’s 4, 5, and 6 are identical to 1 • Testing with a C1060 Tesla in Progress

CUDA Based Clusters • A standard cluster of computers, but also equipped with CUDA hardware. • Some implementations use MPI between nodes and the nodes delegate work to their GPU’s • Otherwise use a master/worker system; where a master computer delegates work to the workers, some implementations the master spawns the kernels into the workers GPU’s

CUDA based clusters have a hard time keeping all elements of the cluster processing data at all times. • This is solved by overloading the processors and pushing more queries at it than there are processors.

Architecture Differences • A CPU is designed to Process information as fast as possible. Where a GPU is designed to process as many tasks as possible on a large scale of data. • GPU’s are designed to increase the number of processing units while CPU’s developed control and more cache. • Though the floating point precision on a GPU is 64 bits. Where are current CPU’s are 128 bits.

Stream Processor vs CPU Architecture

GPU Pipelines

Consumer Level Comparison • Intel i7-970 • Six Processing Cores at 3.2 GHz • ~70 Gflops • ~$600 • NVIDIA GTX 580 • 512 CUDA Cores at 1.5 GHz • ~1500 Gflops • ~$500

Energy Efficiency • Intel i7-970 • 175 Watts • 0.4 Gflops/W • NVIDIA GTX 580 • 200 Watts • 7.5 Gflops/W

Professional Level Comparison • GPGPU Configuration • Intel 5600 Series Dual CPU 1U GPU Server • Low performance CPUs • 2X NVIDIA C2050 CUDA Cards • Compute Performance: 2.06 TFlops • Total Cost: $5,237

Professional Level Comparison • For Comparable Performance CPU Cluster • 12X Six Core Xeon X5690 3.46GHz Dual CPU Racks • Compute Performance: 1.99 TFlops • Total Cost: $ 57,254

Development Costs • GPU Requires More Hardware Insight • $97,000 GPU Programmer Salary • GPU Debugging is More Difficult • Developing code for GPU clusters is much more complicated than CPU clusters alone

Cost Analysis • Standard developer pay starts around $60k • A MPI programmer ranges from $75k-130k • A CUDA programmer rangers from $85k-150k

Costs of developing • Higher investment of employee salary • The ability to choose lower level CPU’s when planning on using GPU’s for processing can cut costs signifigantly. • Xeon Processors are usually 5 times the price of a consumer level processor and most server blades hold multiple processors

More GPU’s can be added to a system (4 is now common on consumer level mother boards and 8 is available on server level) • Energy efficiency and thermal management is much better when working with a GPU setup.

Works Cited • Cui, X. A., & Potok, T. E. (2010). Graphics Processing Unit Enhanced Parallel Document Flocking Clustering . Office of Scientific and Technical Information, U.S. Dept. of Energy . • Gerts, D., Fredette, N., & Winberly, H. (2010). Using GPU Programming for Inverse Spectroscopy. Office of Scientific and Technical Information for Dept. of Energy . • Mueller, F. A., Potok, T. E., & Cui, X. A. (2009). A Programming Model for Massive Data Parallelism with Data Dependencies . Office of Scientific and Technical Information, U.S. Dept. of Energy . • Mueller, F. A., Potok, T. E., & Cui, X. A. (2009). GPU-Accelerated Text Mining. Office of Scientific and Technical Information, U.S. Dept. of Energy .

GPU Programming Ryan Holsman & Robert Mosher