CS 290H Administrivia: May 14, 2008

CS 290H Administrivia: May 14, 2008 • Course project progress reports due next Wed 21 May. • Reading in Saad (second edition): Sections 13.1-13.5

Conjugate gradient iteration x0 = 0, r0 = b, d0 = r0 for k = 1, 2, 3, . . . αk = (rTk-1rk-1) / (dTk-1Adk-1) step length xk = xk-1 + αk dk-1 approx solution rk = rk-1 – αk Adk-1 residual βk = (rTk rk) / (rTk-1rk-1) improvement dk = rk + βk dk-1 search direction • One matrix-vector multiplication per iteration • Two vector dot products per iteration • Four n-vectors of working storage

Preconditioned conjugate gradient iteration x0 = 0, r0 = b, d0 = B-1r0, y0 = B-1r0 for k = 1, 2, 3, . . . αk = (yTk-1rk-1) / (dTk-1Adk-1) step length xk = xk-1 + αk dk-1 approx solution rk = rk-1 – αk Adk-1 residual yk = B-1rk preconditioning solve βk = (yTk rk) / (yTk-1rk-1) improvement dk = yk + βk dk-1 search direction • One matrix-vector multiplication per iteration • One solve with preconditioner per iteration

x A RT R Incomplete Cholesky factorization (IC, ILU) • Compute factors of A by Gaussian elimination, but ignore fill • Preconditioner B = RTR  A, not formed explicitly • Compute B-1z by triangular solves (in time nnz(A)) • Total storage is O(nnz(A)), static data structure • Either symmetric (IC) or nonsymmetric (ILU)

1 4 1 4 3 2 3 2 Incomplete Cholesky and ILU: Variants • Allow one or more “levels of fill” • unpredictable storage requirements • Allow fill whose magnitude exceeds a “drop tolerance” • may get better approximate factors than levels of fill • unpredictable storage requirements • choice of tolerance is ad hoc • Partial pivoting (for nonsymmetric A) • “Modified ILU” (MIC): Add dropped fill to diagonal of U or R • A and RTR have same row sums • good in some PDE contexts

Matrix permutations for iterative methods • Symmetric matrix permutations don’t change the convergence of unpreconditioned CG • Symmetric matrix permutations affect the quality of an incomplete factorization – poorly understood, controversial • Often banded (RCM) is best for IC(0) / ILU(0) • Often min degree & nested dissection are bad for no-fill incomplete factorization but good if some fill is allowed • Some experiments with orderings that use matrix values • e.g. “minimum discarded fill” order • sometimes effective but expensive to compute

Nonsymmetric matrix permutations for iterative methods • Dynamic permutation: ILU with row or column pivoting • E.g. ILUTP (Saad), Matlab “luinc” • More robust but more expensive than ILUT • Static permutation: Try to increase diagonal dominance • E.g. bipartite weighted matching (Duff & Koster) • Often effective; no theory for quality of preconditioner • Field is not well understood, to say the least

1 2 3 4 5 1 4 1 5 2 2 3 3 3 1 4 2 4 5 PA 5 Row permutation for heavy diagonal [Duff, Koster] 1 2 3 4 5 • Represent A as a weighted, undirected bipartite graph (one node for each row and one node for each column) • Find matching (set of independent edges) with maximum product of weights • Permute rows to place matching on diagonal • Matching algorithm also gives a row and column scaling to make all diag elts =1 and all off-diag elts <=1 1 2 3 4 5 A

Preconditioned conjugate gradient iteration x0 = 0, r0 = b, d0 = B-1r0, y0 = B-1r0 for k = 1, 2, 3, . . . αk = (yTk-1rk-1) / (dTk-1Adk-1) step length xk = xk-1 + αk dk-1 approx solution rk = rk-1 – αk Adk-1 residual yk = B-1rk preconditioning solve βk = (yTk rk) / (yTk-1rk-1) improvement dk = yk + βk dk-1 search direction • Several vector inner products per iteration (easy to parallelize) • One matrix-vector multiplication per iteration (medium to parallelize) • One solve with preconditioner per iteration (hard to parallelize)

Incomplete Cholesky and ILU: Issues • Choice of parameters • good: smooth transition from iterative to direct methods • bad: very ad hoc, problem-dependent • tradeoff: time per iteration (more fill => more time)vs # of iterations (more fill => fewer iters) • Effectiveness • condition number usually improves (only) by constant factor (except MIC for some problems from PDEs) • still, often good when tuned for a particular class of problems • Parallelism • Triangular solves are not very parallel • Reordering for parallel triangular solve by graph coloring • Dynamic coloring: ILUM (Saad)

A B-1 Sparse approximate inverses • Compute B-1 A explicitly • Minimize || A B-1 – I ||F (in parallel, by columns) • Variants: factored form of B-1, more fill, . . • Good: very parallel, seldom breaks down • Bad: effectiveness varies widely

n1/2 n1/3 Complexity of linear solvers Time to solve model problem (Poisson’s equation) on regular mesh

P0 P1 P2 P3 x P0 P1 P2 P3 y Matrix-vector product: Parallel implementation • Lay out matrix and vectors by rows • Hard part is matrix-vector producty = A*x • Algorithm Each processor j: Broadcast x(j) Compute y(j) = A(j,:)*x • May send more of x than needed • Partition / reorder matrix to reduce communication

Sparse Matrix-Vector Multiplication

Graph partitioning in theory • If G is a planar graph with n vertices, there exists a set of at most sqrt(6n) vertices whose removal leaves no connected component with more than 2n/3 vertices. (“Planar graphs have sqrt(n)-separators.”) • “Well-shaped” finite element meshes in 3 dimensions have n2/3 - separators. • Also some other classes of graphs – trees, graphs of bounded genus, chordal graphs, bounded-excluded-minor graphs, … • Mostly these theorems come with efficient algorithms, but they aren’t used much.

Graph partitioning in practice • Graph partitioning heuristics have been an active research area for many years, often motivated by partitioning for parallel computation. See CS 240A. • Some techniques: • Spectral partitioning (uses eigenvectors of Laplacian matrix of graph) • Geometric partitioning (for meshes with specified vertex coordinates) • Iterative-swapping (Kernighan-Lin, Fiduccia-Matheysses) • Breadth-first search (GLN 7.3.3, fast but dated) • Many popular modern codes (e.g. Metis, Chaco) use multilevel iterative swapping • Matlab graph partitioning toolbox: see course web page

Parallel Incomplete Cholesky and ILU: Issues • Computing the preconditioner • Parallel direct methods well developed • But IC/ILU is sparser => harder to speed up • Still, you only have to do it once • Applying the preconditioner • Triangular solves are not very parallel • Reordering by graph coloring (see example) • But the orderings are not great for convergence

CS 290H Administrivia: May 14, 2008

CS 290H Administrivia: May 14, 2008

Presentation Transcript

Conjunctive Queries, Datalog, and Recursion

AI in Digital Entertainment

Are you in the right course?

Introduction to Relational Databases and SQL

Lecture #1: Embedded Computing Systems - An Overview

2010 CBT Workshop

Netter ’ s E ssential Histology 2008

2008-09 Budget Update

CSEP 590tv: Quantum Computing

P561: Network Systems Week 6: Transport #2

DSD Annual Report 2008/09

Security and Cryptography

SE 477 Software and Systems Project Management

SE 477 Software and Systems Project Management

Natural Language Processing

NWA INTACTIX USERS GROUP JANUARY 2008

Eigen Representations: Detecting faces in images

BISL and LPF Breakfast 2008