Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems

Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University of Missouri—St. Louis ICCS 2013

Toeplitz • What is Toeplitz? • Where does it come from? • One dimensional regular grid • Another example • The standard Laplacian. But it is often treated as a sparse matrix.

Multilevel Toeplitz • Multilevel Toeplitzis defined w.r.t. the number of levels • Where does it come from? • d-dimensional regular grid • Think about 2D Laplacian

Why Solve (Multilevel) Toeplitz Systems? • Scattered data interpolation [ Figure from http://pythonhosted.org/ ] • The covariance matrix K is multilevel Toeplitz when the Xi’s are on a regular grid • K-1 also appear in many other problems, such as maximum likelihood estimation

How to Solve? • General direct methods via matrix factorizations: O(n3) • Fast direct methods for 1-level Toepiltz: O(n2) • Levinson-Durbin, 1947, 1960 • Bareiss, 1969 • Superfast direct methods for 1-level Toeplitz: O(n logα n) • Pan, 1993 • Stewart, 2003 • Chandrasekaran et al, 2007 • Methods for specialized systems: banded, block Toeplitz, Toeplitz block, etc • General method for any-level Toeplitz: O(n log n) • Chan and Jin, 2007 • Use an iterative solver (e.g., conjugate gradient) • Matrix-vector multiplication through FFT • Circulant preconditioner We parallelize this method

Conjugate Gradient (CG) Multilevel Toeplitz Multilevel circulant

Toeplitz-Multiply • Circulant embedding (1-level case) • In case of symmetry, both T and C arerepresented by the first columns, t and c • Circulant-multiply: • λ = fft(c) • v’ = ifft( λ .* fft(y’) ) • Simple to generalize to d-level case • t is a d-dimensional tensor • Circulant embedding done along all dimensions • FFT and IFFT become multidimensional

Circulant Preconditioning Multilevel Toeplitz T  Multilevel circulantpreconditioner M data representation t (d-D tensor) data representation m (d-D tensor, same size) A 3-D Example to construct m: • Initialize m( :, :, : ) = t( :, :, : ) • s( j, :, : ) = [ (n1-j) * m( j, :, : ) + j * m( n1-j, :, : ) ] / n1, j = 0:n1-1; then copy s to m • s( :, j, : ) = [ (n2-j) * m( :, j, : ) + j * m( :, n2-j, : ) ] / n1, j = 0:n2-1; then copy s to m • s( :, :, j ) = [ (n3-j) * m( :, :, j ) + j * m( :, :, n3-j ) ] / n1, j = 0:n3-1; then copy s to m • In the 1-level case, this preconditioner yields superlinear convergence for CG • In the higher level case, superlinear convergence is lost; but still good performance in practice

Toeplitz CG

Parallelization 1: Toeplitz-Multiply Naïve approach • y’ = Embed y • z = multidimensional-FFT(y’) • w = λ .* z • v’ = multidimensional-IFFT(w) • v = Truncate v’ Less-communication approach • y’’ = Embed y along unpartitioned dims • y’ = FFT(y’’) along unpartitioned dims • Transpose y’ • z’ = Embed y’ along unpartitioned dims • z = FFT(z’) along unpartitioned dims • w = λ .* z • w’’ = IFFT(w) along unpartitioned dims • w’ = Truncate w’’ along unpartitioned dims • Transpose w’ • v’ = IFFT(w’) along unpartitioned dims • v = Truncate v’ along unpartitioned dims Red lines require MPI_Alltoall

Parallelization 1: Toeplitz-Multiply

Parallelization 1: Toeplitz-Multiply • How to partition a d-dimensional data cube? • Use an array of processes • Use a 2-dimensional grid of processes • Use a d’-dimensional grid of processes ( d’ = 1, 2, …, d ) • The larger d’ is, the more processes one can use • The larger d’ is, the smaller the total size of MPI_Alltoall. ( p = p0. p1… pd’-1 )

Parallelization 1: Toeplitz-Multiply

Parallelization 2: Eliminate Allreduce

Parallelization 2: Eliminate Allreduce Computing v = Ty and σ = (v,y) simultaneously: • Use the alltoall in the ifft of w to sum the inner product between z and w • Thus eliminating allreduce

Parallelization 2: Eliminate Allreduce

Overall Solver Performance Strong scaling Weak scaling (220 grid points per core)

Summary • Multilevel Toeplitz matrices appear in, e.g., statistics • Iterative methods have been the methods for multilevel Toeplitz systems so far • We parallelize CG: • Use a multidimensional grid of processes to partition a multidimensional data • Eliminate communication in data embedding • Eliminate allreduce communication for computing inner products • Largest experiment: Solve 1B-by-1B matrix using 1K processes in 1 minute • Other iterative methods (e.g., GMRES) can be similarly parallelized

Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems