270 likes | 432 Views
Numerical Linear Algebra An Introduction. Levels of multiplication . vector-vector a[i]*b[i] matrix-vector A[i,j]*b[j] matrix-matrix A[I,k]*B[k,j]. Matrix-Matrix. for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++)
E N D
Numerical Linear AlgebraAn Introduction High Performance Computing 1
Levels of multiplication • vector-vector a[i]*b[i] • matrix-vector A[i,j]*b[j] • matrix-matrix A[I,k]*B[k,j] High Performance Computing 1
Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n3) work High Performance Computing 1
Block matrix • Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication • Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work High Performance Computing 1
Pipeline method • Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]); High Performance Computing 1
Pipeline method B[*,0] B[*,1] B[*,2] B[*,3] A[0,*] C[0,0] C[0,1] C[0,2] C[0,3] A[1,*] C[1,0] C[1,1] C[1,2] C[1,3] A[2,*] C[2,0] C[2,1] C[2,2] C[2,3] A[3,*] C[3,0] C[3,1] C[3,2] C[3,3] High Performance Computing 1
Pipeline method • Similar method for matrix-vector multiplication. But you lose some of the cache reuse High Performance Computing 1
A sense of speed – vector ops High Performance Computing 1
A sense of speed – vector ops High Performance Computing 1
Observation s • Simple do loops not effective • Cache and memory hierarchy bottlenecks • For better performance, • combine loops • minimize memory transfer High Performance Computing 1
LINPACK • library of subroutines to solve linear algebra • example – LU decomposition and system solve (dgefa and dgesl, resp.) • In turn, built on BLAS • see netlib.org High Performance Computing 1
BLAS Basic Linear Algebra Subprograms • a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. • The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. • The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3. High Performance Computing 1
BLAS • Level 1 High Performance Computing 1
BLAS • Level 2 High Performance Computing 1
BLAS • Level 3 High Performance Computing 1
How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 level 2 SGEMV MN+N+2M 2MN 1/2 level 3 SGEMM 2MN+MK+KN 2MNK 2/N High Performance Computing 1
Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) writey(1:n) back to slow memory High Performance Computing 1
Matrix-vector • m=# slow memory refs = n^2 +3n • f=# arithmetic ops = 2n^2 • q=f/m ~2 • Mat-vec multiple limited by slow memory High Performance Computing 1
Matrix-matrix High Performance Computing 1
Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory * High Performance Computing 1
Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector multiply High Performance Computing 1
Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} write block C(i,j) back to slow memory * High Performance Computing 1
Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n High Performance Computing 1
Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) High Performance Computing 1
More on BLAS • Industry standard interface(evolving) • Vendors, others supply optimized implementations • History • BLAS1 (1970s): • vector operations: dot product, saxpy • m=2*n, f=2*n, q ~1 or less • BLAS2 (mid 1980s) • matrix-vector operations • m=n^2, f=2*n^2, q~2, less overhead • somewhat faster than BLAS1 High Performance Computing 1
More on BLAS • BLAS3 (late 1980s) • matrix-matrix operations: matrix matrix multiply, etc • m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 • Good algorithms used BLAS3 when possible (LAPACK) • www.netlib.org/blas, www.netlib.org/lapack High Performance Computing 1
BLAS on an IBM RS6000/590 Peak speed = 266 Mflops Peak BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) High Performance Computing 1