Presentation of Group 9 W ong S uet -F ai N ewman Chau Man-Hau Dee

Cache Conscious Algorithms for Relational Query Processingby Ambuj Shatdal, Chander Kant, Jeffrey F. Naughton Presentation of Group 9 Wong Suet-Fai Newman Chau Man-Hau Dee

Introduction Why cache performance is so important: • The performance gap between the processor and memory. • Memory access speeds: annualimprovement of only 25% • Processor clock speeds: increase by about 60% every year.

Introduction (Cont.) • Wrong perception: Once data is in memory it is accessed as fast as it could be. • In cache: 2-4 processor cycles • In main memory: 15-25 cycles • If we can keep data in cache: Result: 8% - 200% faster!!

Introduction (cont.) • Ways to improve cache performance: • Larger cache • Better algorithm • To show the benefits in redesigning traditional query processing algorithms so that they can make better use of cache. • This paper focus on join and aggregation algorithms

Major Parameters of Cache • Capacity(C): how big it is • Block Size(B): how many bytes to fetch each time • Associativity(A): No. of unique places in the cache a particular block may reside in. A = C/B i. Fully-associative: A cache where data from any address can be stored in any cache location. ii. Direct mapped: A == 1, i.e. B == C iii. A-way set associative: A > 1 (A compromise between fully-associative And direct mapped) Most caches use direct mapped or very small set-associativity.And LRU replacement policy

3 Types of cache miss: • Compulsory misses • it is the very first reference to a cache block, i.e. The cache line was not accessed before. • Capacity misses • the cache cannot contain all the blocks needed during execution of a program • Conflict misses • also called collision or interference misses • A reference that hits in a fully associative cache but misses in an A-way set associative cache. i.e. Placement restrictions (not fully-associative) cause useful blocks to be displaced. E.g. different memory locations that are mapped to the same cache index.

Optimization Techniques • Background: • Algorithm optimization: To ensure as few cache misses as possible, without much CPU overhead • NOT concern with exact cache configuration: block size and associativity • Use of cache profiler (cprof) to localize optimization space

Technique 1. Blocking • Restructures algorithm to reuse chunks of data that can fit into cache, e.g. with cache size BKSZ: for (i = 0; i < M; i++) for (j = 0; j< N; j++) process(a[i],b[j]) for (bkNo = 0; bkNo < N / BKSZ; bkNo++) for (i = 0; i < M; i++) for (j = bkNo*BKSZ; j< (bkNo+1)*BKSZ; j++) process(a[i],b[j])

Technique 2. Partitioning • To distribute data in partitions, e.g. in sorting: quicksort(relation[n]) partition relation into blocks(size < BKSZ) for each partition r quicksort(relation[BKSZ]) merge partitions Trade-off: overhead of partition creation, but usually the benefit gained should over-shadow it

Difference between blocking and partitioning • Blocking: restructure the algorithm, no change in layout of data • Partitioning: layout of data is reorganized to maximize use of cache

Technique 3. Extracting Relevant Data • Reducing data required. E.g. in sorting: Instead of sorting whole records, we extract the sorting key and pointer of record only. So more relevant data can fit into the cache. Technique 4. Loop Fusion for (i = 0; i < N; i++) { extractKey(a[i]); extractPointer(a[i]); } for (i = 0; i < N; i++) buildHashTable(a[i]) for (i = 0; i < N; i++) { extractKey(a[i]); extractPointer(a[i]); buildHashTable(a[i]) }

Technique 5. Data Clustering • Group related attributes together. E.g. in physical database design level, fields contemporaneously are stored together. This paper concentrates on reducing capacity misses. It focuses on improving temporal and spatial locality of the memory accesses rather than optimal memory layout of relations.

Performance Evaluation Experiment 1. Hash Joins 2 sets of tuples, R & S, build hash on R, and then join S into it. 1. Basic Hash Join BuildHastTable (H[R]) ; for each s in S Probe(s, H[R]) ; 2. With Key Extraction for each r in R { ExtractKeyPointers(r) BuildHastTable (H[R]) ; } for each s in S Probe(s, H[R]) ;

3. Partitioned ExtractKeyPointers_And_Partition(R) ExtractKeyPointers_And_Partition(S) for each partition i BuildHashTable(H[R[i]]) for each s in S[i] Probe(s,H[R[i]])

BaseHash(R,S) overhead of attribute & pointer extraction reduction in cache misses in the build and probe phases overall performance: 7.2% faster than BaseHash about 10% less caches misses Extraction(R,S) no key extraction improvment: divide relation into partitions to ensure hash in building and probling processes, there are less cache misses overall performance: 6.6% faster than BashHash about 25% less caches misses Partitioned(R,S)

Findings in Hash Joins experiment: BaseHash has the fewest compulsory misses The results hold in general – not to specific machines Complier plays a strong role – it affects the the efficiency greatly

Experiment 2. The Sort Merge Join • BaseSort(R,S) ExtractKeyPointers(R) ExtractKeyPointers(S) Sort(R) Sort(S) Merge(R,S) • ImmediateSort(R,S) ExtractKeyPointers(R) Sort(R) ExtractKeyPointers(S) Sort(S) Merge(R,S)

PartitionedSort(R,S) ExtractKeyPointer_and_Partitioned(R) for each partition i Sort(R[i]) ExtractKeyPointer_and_Partitioned(S) for each partition i Sort(S[i]) for each partition i Merge(R[i],S[i]) • ImprovedSort(R,S) ExtractKeyPointer_and_Partitioned(R) ExtractKeyPointer_and_Partitioned(S) for each partition i Sort(R[i]) Sort(S[i]) Merge(R[i],S[i])

Finding in Sort Merge experiment: Much better improvement in Sort Merge than in Hash Join. It’s because Sort operation is much more memory intensive and computationally expensive.

Experiment 3. Nested Loops Traditionally, people think that we can do nothing to improve nested loops which is in memory already BaseNestedLoop(R,S) ExtractKeyPointers(R) ExtractKeyPointers(S) for each tuple r in R for each tuple s in S if join(r,s) then produce result BlockedNestedLoop(R,S) ExtractKeyPointers(R) ExtractKeyPointers(S) for each block b of S for each tuple r in R for each tuple s in b if join(r,s) then produce result

SUN 10/51 performance improvement is not significant because it got 1MB secondary cache, which helps a lot even in the BaseNestedLoop case.

Experiment 4. Aggregation Hash Based Aggregation 1. BaseHash (R) for each tuple t in R Hash(t) Insert/update the hashtable entry for the group 2. Extraction (R) for each tuple t in R ExtractKeyPointer(t) Hash(t) Insert/update the hashtable entry for the group

No improvement !!! • Reason: The hash table is accessed only once. All are compulsory misses and therefore key pointer extraction doesn’t help! • Lesson: Cache optimizations can be subtle and specific to a particular algorithm.

Parametric Studies

Choices of Result Generation in join algorithm • On the Fly: Result tuple is produced as soon as a match is found in join • Lazy: When a match is found, 2 pointers to the responding tuples are stored, generates an in-memory join index. The result is generated later, depending upon need. Why “Lazy” algorithm is not much slower than “On the Fly” ?

Conclusions • Main memory should not be the end of optimization for databases algorithms. • Designing algorithms with cache consideration can significantly improves their performance. • Most of the time we have to use cache profiler to find out the poorly performing parts of the code

References [BE771 M. W. Blasgen and K. P. Eswaran. Storage and access in relational databases. IBM Systems Journal, 16(4), 1977. [DK0+84] David J. Dewitt, Randy H. Katz, Frank Olken, Lenard D. Shapiro, Michael R. Stonebraker, and David Wood. Implementation techniques for main memory database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages l-8, June 1984. [EM’91 Robert Epstein. Techniques for Processing of Aggregates in Relational Database Systems. Memorandum UCB/ERL M79/8, Electronics Research Laboratory, College of Engineering, University of California, Berkeley, February 1979. [HS89] Mark D. Hill and Alan Jay Smith. Evaluating Associativity in CPU Caches. IEEE tinsaca-tions on Computers, 38(12):1612-1630, December 1989. R. E. Kessler and Mark D. Hill. Page Placement Algorithms for Real-Indexed Caches. ACM Tkxansactions in Computer Systems, 10(4):338-359, November 1992. James R. Larus. Efficient Program Tracing. IEEE Computer, 26(5):52-61, May 1993. Alvin R. Lebeck and David A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer (to appear), June 1994. Chris Nyberg, Tom Barclay, Zarca Cvetanovic, Jim Gray, and Dave Lomet. AlphaSort: A RISC Machine Sort. In Proc. of the 1994 ACM SIGMOD Conf., pages 233-242, May 1994. Alan J. Smith. Cache Memories. Computing Surveys, 14(3):473-530, September 1982. Patrick Valduriez. Join Indices. ACM transactions on Database Systems, 12(2):218 - 246, June 1987. http://www-2.cs.cmu.edu/~manjhi/cs740proj/finalReport/node1.html http://burks.brighton.ac.uk/burks/foldoc/17/46.htm http://www.complang.tuwien.ac.at/anton/memory-wall.html#misses http://www.cse.iitd.ernet.in/~csd98414/ISReport/report/node15.html http://www.cae.wisc.edu/~mikko/552/ch7b.pdf

Q & A ~ The End ~

Presentation of Group 9 W ong S uet -F ai N ewman Chau Man-Hau Dee