320 likes | 679 Views
GPU Computing with Matlab® @ CBI Laboratory. Overview. GPU History & Hardware GPU History CPU vs. GPU Hardware Parallelism Design Points GPU Software Infrastructure ( CUDA ) Matlab Parallel Computing Toolbox, GPU Computing GPU nodes @ CBI Lab Examples Additional Features. GPU History.
E N D
Overview • GPU History & Hardware • GPU History • CPU vs. GPU Hardware • Parallelism Design Points • GPU Software Infrastructure ( CUDA ) • Matlab Parallel Computing Toolbox, GPU Computing • GPU nodes @ CBI Lab • Examples • Additional Features
GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) 2 Dimensional Screen Goal: Answer question, for pixel (X,Y) on the screen, what’s my (R,G,B) value
GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) 2 Dimensional Screen Much Parallelism Available & Screen refresh rate << Processor Clock rate
GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) GPU Model: Assembly Line Concept High Latency BUT High Throughput 2 Dimensional Screen
GPU History MATRIX MULTIPLICATION: e.g. 3-D to 2-D Projection ( Perspective Projection ) MATRIX MULTIPLICATION: e.g. Translation, Rotation, Scaling MATRIX MULTIPLICATION: e.g. Rotation 3d 3d 3d 2d Many Independent Computations: Streams of Triangles & Vertices screen 3 vertices (x1,y1,z1) (x2,y2,z2) (x3,y3,z3) The more calculators: the more points we can move around in the same amount of time
GPU History MATRIX MULTIPLICATION: e.g. 3-D to 2-D Projection ( Perspective Projection ) MATRIX MULTIPLICATION: e.g. Translation, Rotation, Scaling MATRIX MULTIPLICATION: e.g. Rotation 3d 3d 3d 2D Many Independent Computations: Streams of Triangles & Vertices screen Why must we be limited to performing a single type of function? The answer involves the start of General Purpose GPU Computing. Allow the programmer to create custom functions ( a.k.a. kernels ) that run in parallel.
GPU vs. CPU Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting Higher Latency Lower Latency Exceptionally High Throughput Good Throughput Which column maps to CPU and which to GPU? • An individual waits as little as possible in line. • Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. • Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining ) • Try to find parallelism within 1 task ( out-of-order execution ) • Try to predict what people may order to get a head start. ( Branch Prediction ) • Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers ) • An individual may need to wait a long time in line, but many more people go through system during the course of a day. • Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line. • More workers/ smaller desks per worker. • Use as much of the building space as possible to add workers.
GPU vs. CPU Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting Higher Latency Lower Latency Exceptionally High Throughput Good Throughput GPU CPU • An individual may need to wait a long time in line, but many more people go through system during the course of a day. • Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line. • More workers/ smaller desks per worker. • Use as much of the building space as possible to add workers. • An individual waits as little as possible in line. • Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. • Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining ) • Try to find parallelism within 1 task ( out-of-order execution ) • Try to predict what people may order to get a head start. ( Branch Prediction ) • Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers )
Parallelism Design Points • Key: Focus on dependency analysis • How much of your program is independent determines potential parallelism ( Amdahl’s Law ) …. For a fixed amount of work in the parallel section… • Gustafson’s Law: Do more work within parallel sections… • Data transfer vs. Compute ( Arithmetic Intensity ) • Cost of moving the data from CPU to GPU needs to be taken into account. • GPU may provide large benefit when ( compute >> data I/O ) • Going to the store to get 100 items with 10 workers: you ideally only want to make 1 trip for all 100 items • Even if all 10 workers go to get their items in parallel, not much benefit if you make 10 round trips. • Resource contention • Data transfer bandwidth
Parallelism Design Points • Resource limits ( memory, disk ) • Hardware limits • Memory cache line sizes, Memory alignment issues, Disk block sizes, Cache sizes, # Queues, etc. • Physical data organization ( e.g. Row Major vs. Column Major ) • Conditional (if-else) minimization • Ideally you would hope to have 0 if statements in your functions…. Not always feasible for algorithm correctness. • Synchronization • Algorithm correctness many times requires some type of synchronization • Many more variables affect function, program, … as well as system level parallelism…. • A function may be highly parallelizable, but overall system parallelism may involve looking at different levels of parallel to achieve good solution.
GPU Hardware Fermi Architecture[16] Many resources are available at www.nvidia.com
GPU Hardware Fermi Architecture[16] Many resources are available at www.nvidia.com
GPU Software Infrastructure CUDA: Compute Unified Device Architecture Applications ( e.g. Matlab ) CUDA C/C++ NVCC Compiler + Utilities ( nvprof, visual profiler ) PTX: Parallel Thread eXecution Assembly Language ( Virtual Machine ) CUBIN( Cuda Binary ) CUDA Libraries CUDA Runtime API CUDA Driver Operating System ( Linux, Windows, etc.) GPU card(s) & System Board with CPU, Buses ( PCIe ),..
GPU Software Infrastructure CUDA: Compute Unified Device Architecture Software model: An abstraction of the hardware Streams: Compute & Data Transfer GPU1,GPU2… Queues (order guaranteed within a single stream) Grids: Run the samekernel( a.k.a. function ) GPU1,GPU2… Blocks: Group of cooperating threads SM(Streaming Multi-processor ) - 32 compute cores per SM in Fermi Architecture. - Blocks should be viewed as self contained work units Warps: Groups of 32 threads SM ( Streaming Multi-processor ) - The basic unit of execution, 32 threads running the same instruction in the same amount of time. Threads: Execution context ( keeps track a core’s state) Compute Core Software to Hardware Mapping
Matlab Parallel Computing Toolbox, GPU Computing • gpuDevice(#) • gpuDeviceCount() • reset(gpuDevice(#)) • wait() • bsxfun() • gpuArray() • gather() • arrayfun() • existsOnGPU() • parallel.gpu.CUDAKernel() • feval • setConstantMemory • Many GPU enabled built-in functions: e.g. fft, …. Check with: • methods(‘gpuArray’) Matlab Parallel Computing Toolbox: Each release, more and more functions are enabled for transparentGPU support.
Matlab Parallel Computing Toolbox, GPU Computing • Many GPU enabled built-in functions: e.g. fft, …. Check with: • methods(‘gpuArray’) • fft,fft2,…. Many built in functions • Try running >> methods(‘gpuArray’) to see the list of support functions.
GPU Nodes @ CBI Lab Nvidia M2070: Fermi Architecture, 448 cuda cores, 14 Multiprocessors, @ 32 cuda cores/Multi Processor • 2 modes: Interactive & Batch • Interactive: Use for development • $ ssh –Y username@cheetah.cbi.utsa.edu$ qlogin -q gpu.q -l gpuonly$ matlab & Batch mode: For production runs • Job Script#!/bin/bash#$ -q gpu.q#$ -l gpuonly[Source: http://www.cbi.utsa.edu/faq/sge/gpu] Putty+Xming can be used to access Matlab GUI from Windows system. http://cbi.utsa.edu/faq/xforwarding
GPU Nodes @CBI Lab Matlab GUI access is also available from Windows, using Putty + x11 forwarding with XMing qlogin –q gpu.q –l gpuonly
GPU Nodes @ CBI Lab matlab & nvidia-smi top >> gpuDevice(#)
GPU Nodes @ CBI Lab M2070: Fermi Architecture, 448 CUDA cores, 14 Multiprocessors, @ 32 cores/Multi Processor
Built-in function support for GPU Quickly solving sets of linear equations has applications throughout science & engineering. • 4x + y - 2z = 0 • 2x -3y + 3z = 9 • -6x -2y + z = 0 • A*x = b • A = [4 1 -2; 2 -3 3; -6 -2 1]; • b = [0; 9; 0]; • What is x? • x = A\b; x = [ 0.75, -2, 0.5 ]; 4*0.75 + (-2) – (2*0.5) = 0 ??? should match if correct solution of system 2*0.75 + (-3*-2) + (3*0.5 ) = 9 ??? should match if correct solution of system -6*0.75 + (-2*-2) + 0.5 = 0 ??? should match if correct solution of system \ operator is one of many functions that work on gpuArray data types.
Many Additional Features • Using Matlab with GPU in Batch mode via Job Script • Calling .cu , .ptx code directly from Matlab • Using the GPU from C/C++ code directly with the MEX interface • Allows incorporating custom GPU code into Matlab as well as using Nvidia Nsight and Nvidia Visual Profiler for custom GPU algorithm development.
Demo An example Matlab code running on a GPU system.
Appendix Many applications are being enabled for GPU acceleration: e.g.NAMD for Molecular Dynamics using GPU http://www.nvidia.com/object/gpu-applications.html http://www.nvidia.com/content/tesla/pdf/gpu-accelerated-applications-for-hpc.pdf C/C++/Fortran Library: AccelereyesArrayfire https://developer.nvidia.com/accelereyes-arrayfire http://www.accelereyes.com/examples/case_studies
Appendix CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization
Appendix CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization
References [1] http://www.mathworks.com/help/distcomp/release-notes.html [2] http://www.mathworks.com/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html [3] http://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html [4] http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html [5] http://www.nvidia.com/docs/IO/105880/DS-Tesla-M-Class-Aug11.pdf [6] http://en.wikipedia.org/wiki/Nvidia_Tesla#cite_note-11 [7] http://en.wikipedia.org/wiki/Rasterisation [8] http://en.wikipedia.org/wiki/Perspective_projection#Perspective_projection [9] http://en.wikipedia.org/wiki/GPGPU [10] http://www.cbi.utsa.edu/faq/sge/gpu [11] http://medim.sth.kth.se/6l2872/F/F11c.pdf (FFT registration ) [12] http://medim.sth.kth.se/6l2872/F/F11c.pdf [13] http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [14] http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf [15] http://en.wikipedia.org/wiki/Nvidia_Tesla [16] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf [17] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf [18] https://www.udacity.com/wiki/cs344/Lesson_1_-_The_GPU_Programming_Model#latency-vs-bandwidth [19] https://www.udacity.com/wiki/cs344 [20] http://www.computingbook.org/FullText.pdf [21] http://en.wikipedia.org/wiki/Dynamic_random-access_memory [22] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2009/lec08-cache.html [23] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2012/lec03-fastest.html [24] http://en.wikipedia.org/wiki/Gustafson%27s_law [25] http://archive.hpcwire.com/hpc/705814.html [26] http://www.johngustafson.net/pubs/pub13/amdahl.pdf [27] http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html [28] http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications
Acknowledgements • This project received computational, research & development, software design/development support from the Computational System Biology Core/Computational Biology Initiative, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. URL: http://www.cbi.utsa.edu
Contact Us http://cbi.utsa.edu