1 / 34

GPU Programming Ryan Holsman & Robert Mosher

GPU Programming Ryan Holsman & Robert Mosher. CUDA Overview. Very High Level Gives access to the GPU via C code You pass in the argument of the number of elements that will receive the function and the CUDA driver addresses the rest. CUDA Approach. CUDA Memory Model. • Device code can:

Download Presentation

GPU Programming Ryan Holsman & Robert Mosher

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU ProgrammingRyan Holsman & Robert Mosher

  2. CUDA Overview • Very High Level • Gives access to the GPU via C code • You pass in the argument of the number of elements that will receive the function and the CUDA driver addresses the rest.

  3. CUDA Approach

  4. CUDA Memory Model • Device code can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory • Host code can – R/W per grid global and constant memories

  5. Recommended Algorithms • Embarrassingly Parallel • Large Matrix Operations • Image and Video Processing • Bioinformatics • Massive Dataset and High Throughput Problems

  6. Interoperability of Code • Programming in CUDA allows for some hardware abstraction • CUDA API allows for detection of GPU configuration and graphics memory • Just-In-Time Compiling • Automatic Program Scalability

  7. Tesla GPGPUs “The NVIDIA Tesla™ C2050 and C2070 Computing Processors fuel the transition to parallel computing and bring the performance of a small cluster to the desktop.” • Target the workstation and cluster market • Offer ECC Memory Protection • 1.03 Tflops per card • “make supercomputing available to everyone.”

  8. Visual Super Computing • Workstation graphics cards, Quadro or FireGL • Most Quadro cards use the same processing cores as a non-quadro equivalent. But apply an extra set of algorithms that put data accuracy over processing speed. • Quadro cards usually provide a higher level of memory for working with larger data sets.

  9. Proprietary Physics Engine • Designed to be a middleware engine by Ageia • Purchased by Nvidia in February 2008 • PhysX originally ran on its own PPU card • The PPU was a Multi-core MIPS architecture based device • Nvidia ported over PhysX software to run on any CUDA enabled card • CUDA Requires 32 or more cores and 256 MB or more of RAM

  10. PhysX Software • Autodesk Suite (3ds Max, Maya, Softimage) • DarkBASIC Pro • DX Studio • Microsoft Robotics Studio • OGRE – Open source rendering engine • The Physics Abstraction Layer

  11. GPU Clusters

  12. Super Computers • Most super computers currently cost $50-$250 per core, Architecture Dependent • But a Nvidia GTX 580 has over 512 computational units for around $500 • GPU’s must do scientific calculations using simple linear operations and are much less complex. • GPU’s having become great for designing Artificial Neural Networks; with the appropriate methodology.

  13. The HPU4Science • Uses master/worker design • 3 workers each with 3-4 high performance GPU’s • Built on a $40,000 Budget • Master is a $7,000 server level system

  14. Master Specs • One thread per worker • Currently supports 12 workers and 16 threads • Dual Xeon 5000 CPU • Single GTX 295 (Dual GPU single card)

  15. Workers • Core i7 Processors • 12 GB Memory Triple Channel Memory • 4 x 1.5GB GTX 580 • Worker 2 has 3 GTX 480’s • Worker 3 has 3 GTX 285’s • Worker’s 4, 5, and 6 are identical to 1 • Testing with a C1060 Tesla in Progress

  16. CUDA Based Clusters • A standard cluster of computers, but also equipped with CUDA hardware. • Some implementations use MPI between nodes and the nodes delegate work to their GPU’s • Otherwise use a master/worker system; where a master computer delegates work to the workers, some implementations the master spawns the kernels into the workers GPU’s

  17. CUDA based clusters have a hard time keeping all elements of the cluster processing data at all times. • This is solved by overloading the processors and pushing more queries at it than there are processors.

  18. Architecture Differences • A CPU is designed to Process information as fast as possible. Where a GPU is designed to process as many tasks as possible on a large scale of data. • GPU’s are designed to increase the number of processing units while CPU’s developed control and more cache. • Though the floating point precision on a GPU is 64 bits. Where are current CPU’s are 128 bits.

  19. Stream Processor vs CPU Architecture

  20. GPU Pipelines

  21. Consumer Level Comparison • Intel i7-970 • Six Processing Cores at 3.2 GHz • ~70 Gflops • ~$600 • NVIDIA GTX 580 • 512 CUDA Cores at 1.5 GHz • ~1500 Gflops • ~$500

  22. Energy Efficiency • Intel i7-970 • 175 Watts • 0.4 Gflops/W • NVIDIA GTX 580 • 200 Watts • 7.5 Gflops/W

  23. Professional Level Comparison • GPGPU Configuration • Intel 5600 Series Dual CPU 1U GPU Server • Low performance CPUs • 2X NVIDIA C2050 CUDA Cards • Compute Performance: 2.06 TFlops • Total Cost: $5,237

  24. Professional Level Comparison • For Comparable Performance CPU Cluster • 12X Six Core Xeon X5690 3.46GHz Dual CPU Racks • Compute Performance: 1.99 TFlops • Total Cost: $ 57,254

  25. Development Costs • GPU Requires More Hardware Insight • $97,000 GPU Programmer Salary • GPU Debugging is More Difficult • Developing code for GPU clusters is much more complicated than CPU clusters alone

  26. Cost Analysis • Standard developer pay starts around $60k • A MPI programmer ranges from $75k-130k • A CUDA programmer rangers from $85k-150k

  27. Costs of developing • Higher investment of employee salary • The ability to choose lower level CPU’s when planning on using GPU’s for processing can cut costs signifigantly. • Xeon Processors are usually 5 times the price of a consumer level processor and most server blades hold multiple processors

  28. More GPU’s can be added to a system (4 is now common on consumer level mother boards and 8 is available on server level) • Energy efficiency and thermal management is much better when working with a GPU setup.

  29. Works Cited • Cui, X. A., & Potok, T. E. (2010). Graphics Processing Unit Enhanced Parallel Document Flocking Clustering . Office of Scientific and Technical Information, U.S. Dept. of Energy . • Gerts, D., Fredette, N., & Winberly, H. (2010). Using GPU Programming for Inverse Spectroscopy. Office of Scientific and Technical Information for Dept. of Energy . • Mueller, F. A., Potok, T. E., & Cui, X. A. (2009). A Programming Model for Massive Data Parallelism with Data Dependencies . Office of Scientific and Technical Information, U.S. Dept. of Energy . • Mueller, F. A., Potok, T. E., & Cui, X. A. (2009). GPU-Accelerated Text Mining. Office of Scientific and Technical Information, U.S. Dept. of Energy .

More Related