200 likes | 355 Views
Monte Carlo Linear Algebra Techniques and Their Parallelization. Ashok Srinivasan Computer Science Florida State University. www.cs.fsu.edu/~asriniva. Outline. Background M onte C arlo Matrix-vector multiplication MC linear solvers Non-diagonal splitting Dense implementation
E N D
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University www.cs.fsu.edu/~asriniva
Outline • Background • Monte Carlo Matrix-vector multiplication • MC linear solvers • Non-diagonal splitting • Dense implementation • Sparse implementation • Parallelization • Conclusions and future work
Background • MC linear solvers are old! • von Neumann and Ulam (1950) • Were not competitive with deterministic techniques • Advantages of MC • Can give approximate solutions fast • Feasible in applications such as preconditioning, graph partitioning, information retrieval, pattern recognition, etc • Can yield selected components fast • Are very latency tolerant
Matrix-vector multiplication pk0 • Compute Cj h, C eR nxn • Choose probability and weight matrices such that • Cij = PijWij and hi = pi wi • Take a random walk, based on these probabilities • Define random variables Xi • X0 = w0, and Xi = Xi-1 Wki ki-1 • E(Xjdikj) = (Cj h)i • Each random walk can be used to estimate the kj th component of Cj h • Convergence rate independent of n k0 Pk1k0 k1 Pk2k1 k2 kj Update (Cj h)kj
Matrix-vector multiplication ... continued pk0 • Smj=0 Cj h too can be similarly estimated • Smj=0 (BC) jBh will be needed by us • It can be estimated using probabilities on both matrices, B and C • Length of random walk is twice that for the previous case k0 Pk1k0 k1 Pk2k1 k2 Pk3k2 k3 Update (Cj h)k2j+1 k2m+1
MC linear solvers • Solve Ax = b • Split A as: A = N – M • Write the fixed point iteration • xm+1 = N-1 Mxm + N-1 b = Cxm + h • If we choose x0 = h, then we get • xm = Smj=0 Cj h • Estimate the above using the Markov chain technique mentioned earlier
Current techniques • Current techniques • Choose N = a diagonal matrix • Ensures efficient computation of C • C is sparse when A is sparse • Example: N = Diagonal of A yields the Jacobi iteration, and the corresponding MC estimate
Properties of MC linear solvers • MC techniques estimate the result of a stationary iteration • Errors from the iterative process • Errors from MC • Reduce the error by • Variance reduction techniques • Residual correction • Choose a better iterative scheme!
Non-diagonal splitting • Observations • It is possible to construct an efficient MC technique for specific splittings, even if explicit construction of C were computationally expensive • It may be possible to implicitly represent C sparsely, even if C is not actually sparse
Our example • Choose N to be the diagonal and sub-diagonal of A d1 s2 d2 sn dn * * * * * * . . . . . . * * * * * * * N = N-1 = • Computing N-1 C is too expensive • Compute xm = Smj=0 (N-1 M)j N-1 b instead
Computing N-1 • Using O(n) storage and precomputation time, any element of N-1 can be computed in constant time • Define T(1) = 1, T(i+1) = T(i) si+1/di+1 • N-1ij = • 0, if i < j • 1/di, if i =j • (-1)i-j /dj T(i)/T(j), otherwise • The entire N-1, if needed, can be computed in O(n2) time
Dense implementation • Compute N-1 and store in O(n2) space • Choose probabilities proportional to the weight of the elements • Use the alias method to sample • Precomputation time proportional to the number of elements • Constant time to generate each sample • Estimate Smj=0 (N-1 M)j N-1 b
Experimental results Walk length = 2
Sparse implementation • We cannot use O(n2) space or time! • Sparse implementation for M is simple • Sparse representation of N-1 • Choose Pij = • 0, if i < j • 1/(n-j+1) otherwise • Sampled from the uniform distribution • Choose Wij = N-1ij Pij • Constant time to determine any Wij • Minor modifications needed when si = 0
Experimental results Walk length = 2
Geometric sampling • Probabilities are better approximated by geometric sampling • Choose Pij = • 0, if i < j • Approximately pj(1-pj)I-j otherwise • X ~ Uniform(0,1), then • ceil [log(1-x)/log(1-p) - 1] ~ Geometric(p) • Choose Wij = N-1ij Pij
Experimental results Walk length = 2
Proc 1 RNG 1 Parallelization Proc 2 RNG 2 Proc 3 RNG 3 23.7 23.2 23.6 23.5 • MC is “embarrassingly” parallel • Identical algorithms are run independently on each processor, with the random number sequences alone being different
MPI vs OpenMP on Origin 2000 • Cache misses cause poor performance of the OpenMP parallelization
Conclusions and future work • Demonstrated that is possible to have effective MC implementations with non-diagonal splittings too • Need to extend this to better iterative schemes • Non-replicated parallelization needs to be considered