700 likes | 723 Views
Explore cache-oblivious algorithms and their application in static searches, matrix multiplication, and more. Learn about the history, optimization techniques, and future work in this field.
E N D
Cache Oblivious Algorithms Theory & Practice Static Research Proficiency Examination Piyush Kumar Department of Computer Science Advisor: Joseph S.B. Mitchell
CO Algorithms: Brief History • Frigo, Leiserson, Prokop, Ramachandran (FOCS 99)Cache Oblivious Algorithms • Harold Prokop’sThesis • Bender, Demaine, Farch-Coltun(FOCS 00) Cache Oblivious B-Trees • … • Arge, Bender, Demaine et.al. (STOC02) CO Priority Queue
Talk Outline… • Motivation Matrix Multiplication/Transposition Static Searches in Bal. Bin. Trees • The Model • CO-Sorting • Some Analysis • CO-Sorting Experiments • Do’s and Don’ts of the model • Future work
Workstations SUN UltraSparc 2: UltraSparc 16kB L1, 512kB L2. SGI Visual Workstation 540: Quad-Pentium III 32kB L1, 1024kB L2. Dell Precision: Dual-Pentium III 32kB L1 512kB L2. IBM ThinkPad 600: Pentium II 32kB L1, 256kB L2. Compaq Presario: AMD K6-III 64kB L1, 256kB L2, 1024kB L3. How can we write portable code that runs efficiently on different multilevel caching architectures?
n ∑ = c a b ij ik kj = k 1 = × C A B Matrix Multiplication (MM)
s s s s n n n n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij [HK81]
Oracle?! s • Tune so that , , and s A B C ( ) ik kj ij s ? just fit into cache = Q s Z • If > , then n s ( ) n ( ) = Q 3 2 ( ) ( ) Q n n s s L ( ) = Q 3 . n L Z n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij • Optimal [HK81] .
Two Three - Level Cache - Level Cache Two Three - Level Cache - Level Cache n s t One parameter per caching level! s One voodoo parameter per caching level! B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT ¬ ¬ ¬ 1 1 1 1 1 1 / / / n for for for to to to i i i n n n s s s ¬ ¬ ¬ 1 1 1 1 1 1 / / / for for for to to to i i i n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t 1 1 1 O M ( , , , ) do 4 A B C s RD - ULT ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t O M ( , , , ) do 4 A B C s 2 2 RD - ULT ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ 7 1 / do for to i t u 2 2 O M ( , , , ) do 7 A B C t RD - ¬ ULT 7 1 / do for to i t u O M ( , , , ) do 7 A B C t 3 RD - ULT ik kj ij ¬ 8 1 / do for to j t u 3 ik kj ij ¬ 8 1 / do for to j t u 3 ¬ 9 1 / do for to k t u 3 ¬ 9 1 / do for to k t u 3 3 10 O M ( , , , ) do A B C u RD - ULT 10 O M ( , , , ) do A B C u RD - ULT ik kj ij ik kj ij
Recursive Matrix Multiplication Recursive Matrix Multiplication . Divide and conquer on × matrices n n C C A A B B = × 11 12 11 12 11 12 C C A A B B 21 22 21 22 21 22 A B A B A B A B 11 11 11 12 12 21 12 22 = + A B A B A B A B 21 11 21 12 22 21 22 22 8 multiplications of ( /2) × ( /2) matrices. n n . 1 addition of × matrices n n
Experiments: MM • Linux Athlon 1Ghz/1Gb/g++ -O3
Experiments: MM • Linux/Itanium/2GB/g++ -O3
Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3
Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3
Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3
Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size =N x (P =100) , tall matrices
Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x (P =1000)
What went Wrong? Blocking! And the loop was InPlace!
Loop not Inplace Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x N
Loop not Inplace Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3
Did we miss something? • Alg 1: Naïve Algorithm • Alg 2: Simple blocking using fixed B • Alg 3: Half Copy • Alg 4: Full Copy • Alg 5: CO • Alg 6: Morton Ordering Chatterjee & Sen HPCA 00
Static Searches • Only for balanced binary trees • Assume there are no insertions and deletions • Only searches Better than O(log n)???!! Can we speed it up?
What is a layout? • Mapping of nodes of a tree to the Memory • Different kinds of layouts • In-order • Post-order • Pre-order • Van Emde Boas • Main Idea : Store Recursive subtrees in contiguous memory
Theoretical Guarantees? • Cache ComplexityQ(n) = • Work ComplexityW(n) = From Prokop’s Thesis
In Practice?? Windows notebook/512MB/PIII 1Gz/256 byte nodes
In Practice II Windows notebook/512MB/PIII 1Gz/32 byte nodes
In Practice III Linux/Itanium/2GB/g++ -O3/ 48 byte nodes
In Practice! • Matrix Operations by Morton Ordering By David S. Wise (Cache oblivious Practical Matrix operation results) • Bender, Duan, Wu (Cache oblivious dictionaries) • Rahman, Cole, Raman (CO B-Trees)
Talk outline… • Motivation (Searching BBT) • The Model • CO-Sorting • …
main memory cache P L Z/L Cache Lines (Z,L) Ideal Cache Model Q • Features: • Two-level hierarchy. • Cache of size Z. • Cache-line lengthL. • Fully associative. • Optimal, omniscient replacement. • Measures: • WorkW. • Cache missesQ.
Assumptions? • Two Levels of Memory • Tall Cache Assumption • Optimal Cache Replacement “No Asymptotic loss” • Fully-associative LRU can be used instead of optimal replacement with no asymptotic loss of performance [ST85]. • Fully-associative LRU caches can be maintained in ordinary memory with constant slowdown in expected performance.
Cache Obliviousness • Cache-oblivious algorithms naturally tune for • varying cache sizes. • multiple levels of cache. When a subproblem fits into a given level of cache, no further cache misses are incurred beyond those required to bring the subproblem itself into the cache. • An optimal cache-oblivious algorithm can be made to run optimally in the HMM [AACN87] and SUMH [VN93] models
CO-Sorting! • Only two methods known • Funnel Sort(Modified Merge Sort) • Distribution Sort ( Modified Sample Sort, We implement a randomized version ) • Column Sort
Funnel Sort • Partition Input into pieces of size each. • Sort each piece Recursively • Merge sorted pieces using a -merger Input Array Sorted Output
Funnel Sort: k-mergers • Takes input k sorted sequences • Outputs k^3 elements! • It’s a clever scheduling of mergers! • Keeps work complexity O(nlogn)
Invoked times (Make sure buffers have enough elements) Buffers Maintained as Circular Queue Funnel Sort: k-Merger R One invocation of R outputs elements
Agarwal and Vitter show that there is an Bound on the number of cache misses. Funnel Sort : Optimality • Work Complexity • Cache Complexity
Distribution Sort • Partition A into sub-arrays each of size ; Sort Recursively • Distribute into buckets • Sort Buckets Recursively • Copy Buckets to output
The Distribution Step • Has to distribute subarrays into buckets • Not In-Place • Similar to recursive Sample-Sort without doing Binary Search on pivots
The Recursive Bucketing used SubArray1 SubArray2 SubArray2 Buffer 1 Buffer 1 Buffer 2