1 / 36

CS179: GPU Programming

CS179: GPU Programming. Lecture 1: Introduction. Today. Course summary Administrative details Brief history of GPU computing Introduction to CUDA. Course Summary. GPU Programming What: GPU: Graphics processing unit -- highly parallel APIs for accelerated hardware Why:

tosca
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 1: Introduction

  2. Today • Course summary • Administrative details • Brief history of GPU computing • Introduction to CUDA

  3. Course Summary • GPU Programming • What: • GPU: Graphics processing unit -- highly parallel • APIs for accelerated hardware • Why: • Parallel processing

  4. Course Summary:Why GPU?

  5. Course Summary:Why GPU? • How many cores, exactly? • GeForce 8800 Ultra (2007) - 128 • GeForce GTX 260 (2008) - 192 • GeForce GTX 295 (2009) - 480* • GeForce GTX 480 (2010) - 480 • GeForce GTX 590 (2011) - 1024* • GeForce GTX 690 (2012) - 3072* • GeForce GTX Titan Z (2014) - 5760* * indicates these are cards shipped with 2 GPUs in them, effectively doubling the cores

  6. Course Summary:Why GPU?

  7. Course Summary:Why GPU?

  8. Course Summary:Why GPU?

  9. Course Summary:Why GPU? • What kinds of speedups do we get?

  10. Course Summary:Overview • What will you learn? • CUDA • Parallelizing problems • Optimizing GPU code • CUDA libraries • What will we not cover? • OpenGL • C/C++

  11. Administrative:Course Details • CS179: GPU Programming • Website: http://courses.cms.caltech.edu/cs179/ • Course Instructors/TA’s: • Connor DeFanti (cdefanti@caltech.edu) • Kevin Yuh (kyuh@caltech.edu) • Overseeing Instructor: • Al Barr (barr@cs.caltech.edu) • Class time: • MWF 5:00-5:55PM

  12. Administrative:Assignments • Homework: • 8 assignments • Each worth 10% of your grade (100 pts. each) • Final Project: • 2 weeks for a custom final project • Details are up to you! • 20% of your grade (200 pts.)

  13. Administrative:Assignments • Assignments will be due Wednesday, 5PM • Extensions may be granted… • Talk to TA’s beforehand! • Office Hours: located in 104 ANB • Connor: Tuesday, 8-10PM • Kevin: Monday, 8-10PM

  14. Administrative:Assignments • Doing the assignments: • CUDA-capable machine required! • Must have NVIDIA GPU • Setting up environment can be tricky • Three options: • DIY with your own setup • Use provided instructions with given environment • Use lab machines

  15. Administrative:Assignments • Submitting assignments: • Due date: Wednesday 5PM • Submit assignment as .tar/.zip, or similar • Include README file! • Name, compilation instructions, answers to conceptual questions on sets, etc. • Submit all assignments to cdefanti@caltech.edu • Receiving graded assignments: • Assignments should get back 1 week after submission • We will email you back with grade and comments

  16. GPU History:Early Days • Before GPUs: • All graphics run on the CPU • Each pixel drawn in series • Super slow! (CS171, anyone?) • Early GPUs: • 1980s: Blitters (fixed image sprites) allowed fast image memory transfer • 1990s: Introduction of DirectX and OpenGL • Brought fixed function pipeline for rendering

  17. GPU History:Early Days • Fixed Function Pipeline: • “Fixed” OpenGL states • Phong or Gouraud shading? • Render as wireframe or solid? • Very limiting, made early games look similar

  18. GPU History:Shaders • Early 2000’s: shaders introduced • Allow for much more interesting shading models

  19. GPU History:Shaders • Shaders: expanded world of rendering greatly • Vertex shaders: apply operations per-vertex • Fragment shaders: apply operations per-pixel • Geometry shaders: apply operations to add new geometry

  20. GPU History:Shaders • These are great when dealing with graphics data… • Vertices, faces, pixels, etc. • What about general purpose? • Can trick GPU • DirectX “compute” shader may be an option • Anything slicker?

  21. GPU History:CUDA • 2007: NVIDIA introduces CUDA • C-style programming API for GPU • Easier to do GPGPU • Easier memory handling • Better tools, libraries, etc.

  22. GPU History:CUDA • New advantages on the table: • Scattered reads • Shared memory • Faster memory transfer to/from the GPU

  23. GPU History:Other APIs • Plenty of other API’s exist for GPGPU • OpenCL/WebCL • DirectX Compute Shader • Other

  24. Using the GPU • Highly parallelizable parts of computational problems

  25. A simple problem… • Add two arrays • A[] + B[] -> C[] • On the CPU: (allocate memory for C) For (i from 1 to array length) C[i] <- A[i] + B[i] • Operates sequentially… can we do better?

  26. A simple problem… • On the CPU (multi-threaded): (allocate memory for C) Create # of threads equal to number of cores on processor (around 2, 4, perhaps 8) (Allocate portions of A, B, C to each thread...) ... In each thread, For (i from beginning region of thread) C[i] <- A[i] + B[i] //lots of waiting involved for memory reads, writes, ... Wait for threads to synchronize... • Slightly faster – 2-8x (slightly more with other tricks)

  27. A simple problem… • How many threads? How does performance scale? • Context switching: High penalty on the CPU!

  28. A simple problem… • On the GPU: (allocate memory for A, B, C on GPU) Create the “kernel” – each thread will perform one (or a few) additions Specify the following kernel operation: For (all i‘s assigned to this thread) C[i] <- A[i] + B[i] Start ~20000 (!) threads Wait for threads to synchronize... • Speedup: Very high! (e.g. 10x, 100x)

  29. GPU: Strengths Revealed • Parallelism • Low context switch penalty! • We can “cover up” performance loss by creating more threads!

  30. GPU Computing: Step by Step • Setup inputs on the host (CPU-accessible memory) • Allocate memory for inputs on the GPU • Copy inputs from host to GPU • Allocate memory for outputs on the host • Allocate memory for outputs on the GPU • Start GPU kernel • Copy output from GPU to host • (Copying can be asynchronous)

  31. GPU: Internals • Blocks: Groups of threads • Can cooperate via shared memory • Can synchronize with each other • Max size: 512, 1024 threads (hardware-dependent) • Warps: Subgroups of threads within block • Execute “in-step” • Size: 32 threads

  32. GPU: Internals Block SIMD processing unit Warp

  33. The Kernel • Our “parallel” function • Simple implementation (won’t work for lots of values)

  34. Indexing • Can get a block ID and thread ID within the block: • Unique thread ID!

  35. Calling the Kernel

  36. Calling the Kernel (2)

More Related