Dense Matrix Algorithms

Dense Matrix Algorithms CS 524 – High-Performance Computing

Definitions • p = number of processors (0 to p-1) • n = dimension of array/matrix (0 to n-1) • q = number of blocks along one dimension (0 to q-1) • tc = computation time for one flop • ts = communication startup time • tw = communication transfer time per word • Interconnection network: crossbar switch with bi-directional links CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Uniform Striped Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Matrix Transpose (MT) • AT(i, j) = A(j, i) for all I and j • Sequential run-time do i = 0, n-1 do j = 0, n-1 B(i, j) = A(j, i) end do end do • Run time is (n2 – n)/2 or n2/2 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

MT - Checkerboard Partitioning (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

MT – Checkerboard Partitioning (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

MT – Striped Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Matrix-Vector Multiplication (MVM) • MVM: y = Ax do i = 0, n-1 do j = 0, n-1 y(i) = y(i) + A(i, j)*x(j) end do end do • Sequential algorithm requires n2 multiplications and additions • Assuming one flop takes tc time, sequential run time is 2tcn2 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Row-wise Striping – p = n (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Row-wise Striping – p = n (2) • Data partitioning: Pi has row i of A and element i of x • Communication: Each processor broadcasts its element of x • Computation: Each processor perform n additions and multiplications • Parallel run time: Tp = 2ntc + p(ts + tw) = 2ntc + n(ts + tw) • Algorithm is cost-optimal as both parallel and serial cost is O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Row-wise Striping – p < n • Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x • Communication: Each processor broadcasts its elements of x • Computation: Each processor perform n2/p additions and multiplications • Parallel run time: Tp = 2tcn2/p+ p[ts + (n/p)tw] • Algorithm is cost-optimal for p = O(n) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – p = n2 (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – p = n2 (2) • Data partitioning: Each processor has one element of A; only processors in last column have one element of x • Communication • One element of x from last column to diagonal processor • Broadcast from diagonal processor to all processors in column • Global sum of y from all processors in row to last processor • Computation: one multiplication + addition • Parallel run time: Tp = 2tc + 3(ts + tw) • Algorithm is cost-optimal as serial and parallel cost is O(n2) • For bus network, communication time is 3n(ts + tw); system is not cost-optimal as cost is O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – p < n2 • Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x • Communication • n/√p elements of x from last column to diagonal processor • Broadcast from diagonal processor to all processors in column • Global sum of y from all processors in row to last processor • Computation: n2/p multiplications + additions • Parallel run time: Tp = 2tcn2/p+ 3(ts + tw n/√p) • Algorithm is cost-optimal only if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Matrix-Matrix Multiplication (MMM) • C = A x B, n x n square matrices • Block matrix multiplication: algebraic operations on sub-matrices or blocks of matrices. This view of MMM aids parallelization. do i = 0, q-1 do j = 0, q-1 do k = 0, q-1 Ci,j = Ci,j + Ai,k x Bk,j end do end do end do • Number of multiplications + additions = n3 . Sequential run time = 2tcn3 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – q = √p • Data partitioning: Pi,j has Ai,j and Bi,j blocks of A and B of dimension n/√p x n/√p • Communication: Each processor broadcasts its submatrix Ai,j to all processors in row; each processor broadcasts its submatrix Bi,j to all processors in column • Computation: Each processor performs n*n/√p* n/√p = n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal only if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Cannon’s Algorithm (1) • Memory-efficient version of the checkerboard partitioned block MMM • At any time, each processor has one block of A and B • Blocks are cycled after each computation in such a way that after √p computations the multiplication is done for Ci,j • Initial distribution of matrices is same as checkerboard partitioning • Communication • Initial: block Ai,j is moved left by i steps (with wraparound); block Bi,j is moved up by j steps (with wraparound) • Subsequent √p-1 : block Ai,j is moved left by one step; block Bi,j moved up by one step (both with wraparound) • After √p computation and communication steps the multiplication is complete for Ci,j CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Cannon’s Algorithm (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Cannon’s Algorithm (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Cannon’s Algorithm (4) • Communication • √p point-to-point communications of size n2/p along rows • √p point-to-point communications of size n2/p along columns • Computation: over √p steps, each processors performs n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Fox’s Algorithm (1) • Another memory-efficient version of the checkerboard partitioned block MMM • Initial distribution of matrices is same as checkerboard partitioning • At any time, each processor has one block of A and B • Steps (repeated √p times) • Broadcast Ai,i to all processors in the row • Multiply block of A received with resident block of B • Send the block of B up one step (with wraparound) • Select block Ai,(j+1)mod√p and broadcast to all processors in row. Go to 2. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Fox’s Algorithm (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Fox’s Algorithm (3) • Communication • √p broadcasts of size n2/p along rows • √p point-to-point communications of size n2/p along columns • Computation: Each processor performs n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Solving a System of Linear Equations • System of linear equations, Ax = b • A is dense n x n matrix of coefficients • b is n x 1 vector of RHS values • x is n x 1 vector of unknowns • Solving x is usually done in two stages • First, Ax = b is reduced to Ux = y, where U is an unit upper triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination. • Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Gaussian Elimination (1) do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end do end do CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Gaussian Elimination (2) • Computations • Approximately n2/2 divisions • Approximately n3/3 – n2/2 multiplications + subtractions • Approx. sequential run time: Ts = 2tcn3/3 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Striped Partitioning – p = n (1) • Data partitioning: Each processor has one row of matrix A • Communication during k (outermost loop) • broadcast of active part of kth (size: n–k–1) row to processors k+1 to n-1 • Computation during iteration k (outermost loop) • n – k -1 divisions at processor Pk • n –k -1 multiplications + subtractions for processors Pi (k < i < n) • Parallel run time: Tp = (3/2)n(n-1)tc + nts + 0.5n(n-1)tw • Algorithm is not cost-optimal since serial and parallel costs are O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Striped Partitioning – p = n (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Striped Partitioning – p = n (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Pipelined Version (Striped Partitioning) • In the non-pipelined or synchronous version, outer loop k is executed in order. • When Pk is performing the division step, all other processors are idle • When performing the elimination step, only processors k+1 to n-1 are active; rest are idle • In pipelined version, the division step, communication, and elimination step are overlapped. • Each processor: communicates, if it has data to communicate; computes, if it has computations to be done; or waits, if none of these can be done. • Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Pipelined Version (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Pipelined Version (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Striped Partitioning – p < n (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Striped Partitioning – p < n (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – p = n2 (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Checkerboard Partitioning – p = n2 (2) • Data partitioning: Pi,j has element A(i, j) of matrix A • Communication during iteration k (outermost loop) • Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row • Broadcast of modified A(i,k) along ith row for k ≤ i < n • Broadcast of modified A(k,j) along jth column for k ≤ j < n • Computation during iteration k (outermost loop) • One division at Pk,k • One multiplication + subtraction at processors Pi,j (k < i ,j< n) • Parallel run time: Tp = (3/2)n(n-1)tc + n[ts + 0.5(n-1)tw] • Algorithm is cost-optimal since serial and parallel costs are O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Back-Substitution • Solution of Ux = y, where U is unit upper triangular matrix do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end do end do • Computation: approx. n2/2 multiplications + subtractions • Parallel algorithm is similar to that for the Gaussian elimination stage CS 524 (Wi 2003/04)- Asim Karim @ LUMS

Dense Matrix Algorithms