3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms

Cache-ObliviousAlgorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion

Assumption • Only two levels of memory hierarchies: • An ideal cache • Fully associative • Optimal replacement strategy • “Tall cache” • A very large memory

An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

Cache Complexity • An algorithm with input size n is measured by: • Work complexity W(n) • Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

Cache Aware Algorithms • Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). • Need to adjust parameters when running on different platforms.

Example: • A blocked matrix multiplication algorithm • s is a tuning parameter to make the algorithm run fast s s A11 A n

Example (2) • Cache complexity • The three s x s sub matrices should fit into the cache so they occupy cache lines • Optimal performance is obtained when • Z/L cache misses needed to bring 3 sub matrices into cache • n2/L cache misses needed to read n2 elements • It is

Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition and FFT • Conclusion

Cache Oblivious Algorithms • Have no parameters about hardware, such as cache size (Z), cache-line length (L). • No tuning needed, platform independent. • The following algorithms introduced are proved to have the optimal cache complexity.

Matrix Multiplication • Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p • Proceed recursively until reach the base case - one element. n≥ max (m, p) m≥ max (n, p) p ≥ max (n, m)

Matrix Multiplication (2) Assume Sizes of A, B are nx4n, 4nxn A*B + A1*B1 A2*B2 + + A11*B11 A12*B12 A21*B21 A22*B22

Matrix Multiplication (3) • Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

Matrix Multiplication (4) • Cache complexity • Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) • For a square matrix, the optimal cache complexity is achieved.

Matrix Transposition • If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) A AT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

Matrix Transposition (2) • Partition array A along the longer dimension and recursively execute the transpose function. A21 A11 A11T A12T A12 A22 A21T A22T

Matrix Transposition (3) • Cache complexity • It has the optimal cache complexity • Q(m, n) = Θ(1+mn/L)

Fast Fourier Transform • Use Cooley-Tukey algorithm • Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as: • Perform n2 DFTs of size n1. • Multiply by complex roots of unity called twiddle factors. • Perform n1 DFTs of size n2.

Assume X is a row-major n1× n2 matrix • Steps: • Transpose X in place. • Compute n2 DFTs • Multiply by twiddle factors • Transpose X in place • Compute n1 DFTs • Transpose X in-place

Fast Fourier Transform n1=4, n2=2 Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return *twiddle factor Transpose to select n1 DFT of size n2 Transpose and return

Fast Fourier Transform • Cache complexity • Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 • Q(n) = O(1+(n/L)(1+logzn)

Other Cache Oblivious Algorithms • Funnelsort • Distribution sort • LU decomposition without pivots

Questions • How large is the range of practicality of cache-oblivious algorithms? • What are the relative strengths of cache-oblivious and cache-aware algorithms?

Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N2

Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N3

Question 2 • Do cache-oblivious algorithms perform as well as cache-aware algorithms? • FFTW library • No answer yet.

References • Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. • Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. • Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms

Presentation Transcript

oblivious

Cache Algorithms

Cache Oblivious Search Trees via Binary Trees of Small Height

External-Memory and Cache-Oblivious Algorithms: Theory and Experiments

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Gerth Stølting Brodal Aarhus Univer

The Study of Cache Oblivious Algorithms

Cache- Oblivious Data Structures and Algorithms for Undirected BFS and SSSP

Cache-Oblivious Dynamic Dictionaries with Update/Query Tradeoffs

Cache-Oblivious Dynamic Dictionaries with Update/Query Tradeoff

Cache-Oblivious Dynamic Dictionaries with Optimal Update/Query Tradeoff

A Cache-Oblivious Implicit Dictionary with the Working Set Property

A Comparison of Cache-conscious and Cache-oblivious Programs

Cache-oblivious Programming

Cache-Oblivious Query Processing

Cache Based Iterative Algorithms

Concurrent Cache-Oblivious B-trees Using Transactional Memory

Cache-Oblivious Algorithms

Cache-Oblivious Priority Queue and Graph Algorithm Applications

OBLIVIOUS

Low Depth Cache-Oblivious Algorithms

Cache-Oblivious Query Processing

Cache-oblivious Programming