340 likes | 547 Views
3.2 Cache Oblivious Algorithms. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. Outline. Cache complexity
E N D
Cache-ObliviousAlgorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.
Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion
Assumption • Only two levels of memory hierarchies: • An ideal cache • Fully associative • Optimal replacement strategy • “Tall cache” • A very large memory
An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line
Cache Complexity • An algorithm with input size n is measured by: • Work complexity W(n) • Cache complexity: the number of cache misses it incurs. Q(n; Z, L)
Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion
Cache Aware Algorithms • Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). • Need to adjust parameters when running on different platforms.
Example: • A blocked matrix multiplication algorithm • s is a tuning parameter to make the algorithm run fast s s A11 A n
Example (2) • Cache complexity • The three s x s sub matrices should fit into the cache so they occupy cache lines • Optimal performance is obtained when • Z/L cache misses needed to bring 3 sub matrices into cache • n2/L cache misses needed to read n2 elements • It is
Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition and FFT • Conclusion
Cache Oblivious Algorithms • Have no parameters about hardware, such as cache size (Z), cache-line length (L). • No tuning needed, platform independent. • The following algorithms introduced are proved to have the optimal cache complexity.
Matrix Multiplication • Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p • Proceed recursively until reach the base case - one element. n≥ max (m, p) m≥ max (n, p) p ≥ max (n, m)
Matrix Multiplication (2) Assume Sizes of A, B are nx4n, 4nxn A*B + A1*B1 A2*B2 + + A11*B11 A12*B12 A21*B21 A22*B22
Matrix Multiplication (3) • Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.
Matrix Multiplication (4) • Cache complexity • Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) • For a square matrix, the optimal cache complexity is achieved.
Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion
Matrix Transposition • If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) A AT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m
Matrix Transposition (2) • Partition array A along the longer dimension and recursively execute the transpose function. A21 A11 A11T A12T A12 A22 A21T A22T
Matrix Transposition (3) • Cache complexity • It has the optimal cache complexity • Q(m, n) = Θ(1+mn/L)
Fast Fourier Transform • Use Cooley-Tukey algorithm • Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as: • Perform n2 DFTs of size n1. • Multiply by complex roots of unity called twiddle factors. • Perform n1 DFTs of size n2.
n1 n2
Assume X is a row-major n1× n2 matrix • Steps: • Transpose X in place. • Compute n2 DFTs • Multiply by twiddle factors • Transpose X in place • Compute n1 DFTs • Transpose X in-place
Fast Fourier Transform n1=4, n2=2 Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return *twiddle factor Transpose to select n1 DFT of size n2 Transpose and return
Fast Fourier Transform • Cache complexity • Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 • Q(n) = O(1+(n/L)(1+logzn)
Other Cache Oblivious Algorithms • Funnelsort • Distribution sort • LU decomposition without pivots
Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion
Questions • How large is the range of practicality of cache-oblivious algorithms? • What are the relative strengths of cache-oblivious and cache-aware algorithms?
Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N2
Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N3
Question 2 • Do cache-oblivious algorithms perform as well as cache-aware algorithms? • FFTW library • No answer yet.
References • Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. • Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. • Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.