210 likes | 434 Views
GPU acceleration in Matlab. Jan Kamenick ý. UTIA Friday seminar 9.11.2012. GPU acceleration. CPU fast general-purpose GPU highly parallel handles specific tasks with large amount of data m emory transfers needed. GPU acceleration in Matlab. Build-in functions
E N D
GPU acceleration in Matlab Jan Kamenický UTIA Friday seminar 9.11.2012
GPU acceleration • CPU • fast • general-purpose • GPU • highly parallel • handles specific tasks with large amount of data • memory transfers needed
GPU acceleration in Matlab • Build-in functions • many Matlab functions support GPU acceleration natively • arrayfun • specific element-wise processing • CUDA kernels • write “.cu” files • compile to “.ptx” (parallel thread execution) • run using feval
Prerequisites • Matlab 2010b or newer • Parallel Computing Toolbox ver
Prerequisites >> ver ------------------------------------------------------------------------------------- MATLAB Version 7.13.0.564 (R2011b) MATLAB License Number: XXXXXX Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1) Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode ------------------------------------------------------------------------------------- MATLAB Version 7.13 (R2011b) Simulink Version 7.8 (R2011b) Computer Vision System Toolbox Version 4.1 (R2011b) Curve Fitting Toolbox Version 3.2 (R2011b) DSP System Toolbox Version 8.1 (R2011b) Data Acquisition Toolbox Version 3.0 (R2011b) Filter Design HDL Coder Version 2.9 (R2011b) Fixed-Point Toolbox Version 3.4 (R2011b) Global Optimization Toolbox Version 3.2 (R2011b) Image Acquisition Toolbox Version 4.2 (R2011b) Image Processing Toolbox Version 7.3 (R2011b) MATLAB Compiler Version 4.16 (R2011b) MATLAB Distributed Computing Server Version 5.2 (R2011b) Neural Network Toolbox Version 7.0.2 (R2011b) Optimization Toolbox Version 6.1 (R2011b) Parallel Computing Toolbox Version 5.2 (R2011b) Partial Differential Equation Toolbox Version 1.0.19 (R2011b) Signal Processing Toolbox Version 6.16 (R2011b) Simulink 3D Animation Version 6.0 (R2011b) Statistics Toolbox Version 7.6 (R2011b) Symbolic Math Toolbox Version 5.7 (R2011b) Wavelet Toolbox Version 4.8 (R2011b)
Prerequisites >> ver------------------------------------------------------------------------------------- MATLAB Version 7.13.0.564 (R2011b)MATLAB License Number: XXXXXX Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1) Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode ------------------------------------------------------------------------------------- MATLAB Version 7.13 (R2011b) Simulink Version 7.8 (R2011b) Computer Vision System Toolbox Version 4.1 (R2011b) Curve Fitting Toolbox Version 3.2 (R2011b) DSP System Toolbox Version 8.1 (R2011b) Data Acquisition Toolbox Version 3.0 (R2011b) Filter Design HDL Coder Version 2.9 (R2011b) Fixed-Point Toolbox Version 3.4 (R2011b) Global Optimization Toolbox Version 3.2 (R2011b) Image Acquisition Toolbox Version 4.2 (R2011b) Image Processing Toolbox Version 7.3 (R2011b) MATLAB Compiler Version 4.16 (R2011b) MATLAB Distributed Computing Server Version 5.2 (R2011b) Neural Network Toolbox Version 7.0.2 (R2011b) Optimization Toolbox Version 6.1 (R2011b) Parallel Computing Toolbox Version 5.2 (R2011b)Partial Differential Equation Toolbox Version 1.0.19 (R2011b) Signal Processing Toolbox Version 6.16 (R2011b) Simulink 3D Animation Version 6.0 (R2011b) Statistics Toolbox Version 7.6 (R2011b) Symbolic Math Toolbox Version 5.7 (R2011b) Wavelet Toolbox Version 4.8 (R2011b)
Prerequisites • Matlab 2010b or newer • Parallel Computing Toolbox ver • NVIDIA GPU with CUDA version 1.3 or higher gpuDevice
Prerequisites >> gpuDevice ans = parallel.gpu.CUDADevicehandle Package: parallel.gpu Properties: Name: 'GeForce GTX 285' Index: 1 ComputeCapability: '1.3' SupportsDouble: 1 DriverVersion: 5 MaxThreadsPerBlock: 512 MaxShmemPerBlock: 16384 MaxThreadBlockSize: [512 512 64] MaxGridSize: [65535 65535] SIMDWidth: 32 TotalMemory: 2.1475e+009 FreeMemory: 1.9656e+009 MultiprocessorCount: 30 ClockRateKHz: 1476000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 CanMapHostMemory: 1 DeviceSupported: 1 DeviceSelected: 1 Methods, Events, Superclasses
Prerequisites >> gpuDevice ans = parallel.gpu.CUDADevicehandle Package: parallel.gpu Properties: Name: 'GeForce GTX 285' Index: 1 ComputeCapability: '1.3' SupportsDouble: 1 DriverVersion: 5 MaxThreadsPerBlock: 512 MaxShmemPerBlock: 16384 MaxThreadBlockSize: [512 512 64] MaxGridSize: [65535 65535] SIMDWidth: 32 TotalMemory: 2.1475e+009 FreeMemory: 1.9656e+009 MultiprocessorCount: 30 ClockRateKHz: 1476000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 CanMapHostMemory: 1 DeviceSupported: 1 DeviceSelected: 1 Methods,Events,Superclasses
Basic usage • Send data to GPU • either allocate there or transfer from workspace • Run Matlab functions • GPU acceleration is used automatically • Retrieve the output data
GPUArray class parallel.gpu.GPUArray • main data class for GPU computations • stored in the GPU memory • create directly using static methods • copy from existing data gpuArray(img)
GPUArray class • Supported data types: (u)int8, (u)int16, (u)int32, (u)int64, single, double, logical • determine the type using classUnderlying(gpuVar) • Retrieve the data using workspaceVar = gather(gpuVar)
GPU accelerated Matlab functions (2012b) methods(‘parallel.gpu.GPUArray’)
Simple example • Solve system of linear equations (Ax = b) A = gpuArray(A); b = gpuArray(b); x = A\b; x = gather(x);
Simple example M = fft2(msk); • Compute convolution using FFT img = gpuArray(img); msk = padarray(msk,size(img)-size(msk),0,'post'); msk= gpuArray(msk); I = fft2(img); M = fft2(msk,size(img,1),size(img,2)); res = real(ifft2(I.*M)); res = gather(res);
Profiling • Before optimizing (trying to use GPU) locate promising parts of code like • custom code consuming the majority of time • build-in functions that support GPUArray (consuming the majority of time) • large input/output data, simple data types • Test the speed afterwards • GPU code cannot be profiled