320 likes | 693 Views
Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine
E N D
Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca Parallel Solving massive linear equations with CUDA
Introduction • Specify the main issue on the problem • Introduction of CLapack and CULA • CPU solving routine • Design and Implement parallel solving routine • Naïve • Coalesced access • Conclusion CPU VS. GPU: Mythbusters Demo GPU versus CPU
FEM The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain e.g. Approximation of a circle with large number of lines In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes. FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.
Main issue: Linear equations Main issue: : the force on each node : displacement for each node (it will be three-dimensional which includes ) : the top-level matrix (global matrix). As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K
How to get • Condition: non-singular matrix (full rank, e.g. square matrix) • General: Coppersmith-Winograd: • Inefficient for the high-dimension matrix • Better to avoid [1] [1].http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm
(C)Lapack (C)Lapack • Provides many routines for solving systems of linear equations such as: • Linear least squares • Eigenvalue problems • Singular value decomposition • by applying LU, QR, Cholesky and Schurdecomposition • A free software library for numerical linear algebra • CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler
LU decomposition dgesv() • DGESV computes the solution to a real system of linear equations • A * X = B, • where A is an N-by-N matrix and X and B are N-by-NRHS matrices. • The LU decomposition with partial pivoting and row interchanges is used to factor A as • A = P * L * U, • where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. Limitation: Fits for small n Constrains by the full rank Hard to be parallel processed
QR decomposition dgels() • solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. • It is assumed that A has full rank. • If m>= n: find the least squares solution of an overdetermined system • minimize || B - Ax || • if m< n: find the minimum norm solution of an underdetermined system • A x = B
CULA • What is CULA • CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.
Y=Ab • Another question: • Options: • CUBLAS Library • Simple dgemv on CPU • Parallel dgemv on GPU(with and without coalesced read)
CUBLAS Library Intro The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU). Basic step to use the CUBLAS library: • Allocate matrices and vectors in the GPU • Fill the data • Call the sequence of desired CUBLAS functions • Upload the results from the GPU memory space back to the host. The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.
CUBLAS<t>gemv() cublasSgemv() • This function performs the matrix-vector multiplication • y = α op ( A ) x + β y • where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs
Parallel Sgemv on GPU • Method • For each threads in blocks run a loop with: • in this case, • Kernel implementation • unsigned intid = blockIdx.x * blockDim.x + threadIdx.x; • for(unsigned int k = 0;k< Row; k++) • { • if(i < size) c[i] += a[id* Row + k] * b[k]; • }
Experimental configuration • Test Platform • Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz • GeForce GTX 295 (only use single GPU) • CUDA 5.5 • Data • rand() in C single/double precision • Matrix size: ( ) • Timing function • CPU: clock_t • GPU: CUDA event • *Time compare excludes transfer between system and GPU memory
Cont'd • CLapack • dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO) • dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info) • CUBLAS Library • cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)
Cont'd • Simple Dgemv (CPU) • void simple_sgemv(float *A, float *B, float *C,unsigned int size) • Simple Dgemv(GPU) • __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size
Performance CLapack
Cont'd CPU VS. GPU(and CUBLAS) Not coalesced access!
Coalesced Access threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read. MartixA: 0 1 2 3 4 5 6 7 8 9 a b Thread 0: 0 4 8 Thread 1: 1 5 9 Thread 2: 2 6 a Thread 3: 3 7 b Thread 0: 0 1 2 Thread 1: 3 4 5 Thread 2: 6 7 8 Thread 3: 9 a b
Cont’d Naïve VS. Coalesced Access
Analysis N = 1200
Cont’d N = 3000
Conclusion • The parallel compute method on GPU have a significant effect when small data has been involved in solving routine. • Data(Matrix) pre-processing also accounts for a large proportion of time.
Future work • Multicore compute on GPU • Reduce the pre-process time on data • LU, QR in CUDA version • Very Large Matrix K decomposition
References • LU Matrix Decomposition in parallel with CUDA: • http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/ • QR Decomposition on GPUs • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf • The QR Algorithm for Finding Eigenvectors • http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf
Appendix A:Algorithm LU decomposition Algorithm: Pivot Pivot
QR Decomposition Methods • Gram‐Schmidt Algorithm • Givens rotation • Householder reflection
Appendix B:Other methods on GPU • CUFFT • Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features: • 1D, 2D, and 3D transforms of complex and real-valued data • Batch execution for doing multiple transforms of any dimension in parallel • 2D and 3D transform sizes in the range [2, 16384] in any dimension • 1D transform sizes up to 8 million elements • In place and out of place transforms for real and complex data • Double precision transforms on compatible hardware (GT200 and later GPUs) • Support for streamed execution, enabling simultaneous computation together with data movement