CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing Tuesday, October 19, 2004

Review of Thursday’s Class • Bridge tables • Hierarchies • Multi-Valued Dimensions • Extraction-Transformation-Load • Data staging area vs. data warehouse • Assigning surrogate keys • Detecting changed rows • Detecting duplicate dimension rows

Outline of Today’s Class • Database System Architecture • Memory management • Secondary storage (disk) • Query planning process • Joins • Nested Loop Join • Merge Join • Hash Join • Grouping • Sort vs. Hash

Memory Management • Modern operating systems use virtual memory • Address space larger than physical main memory • Layer of indirection between virtual addresses and physical address • Physical memory serves as a cache for most recently accessed memory locations • Infrequently accessed memory locations written to disk when cache fills up • Memory transferred in large blocks called pages • Disk access: latency is large relative to throughput • Benefit from locality of reference • Transfer overheads amortized over many requests

Memory Management • Database management systems (DBMSs) use a similar model to virtual memory • Data is stored persistently on disk • Main memory serves as cache for frequently accessed data • This memory cache is called the buffer pool • Data organized into blocks called pages • DBMSs override built-in virtual memory • Database operations are very data intensive • Careful management of disk access essential for good performance • DBMS has specialized knowledge of its data access patterns • Specialized memory management includes: • Prefetching pages likely to be needed in the future • Essential / frequently accessed pages pinned in memory • Multiple buffer pools with different parameters for different needs • Parameters: Page size, page replacement policy • Needs: Relations, indexes, temporary results, sort space

Disk Access Time • Seek cost + rotational latency + transfer cost • Seek cost • Move the disk head to the right place • Physical machinery • Rotational latency • Wait for the disk to spin • Average wait = ½ rotation • Transfer cost • Read the data while the disk spins Disk drive Seek cost Rotational latency

Random vs. Sequential I/O • Seek cost and rotational latency are overhead • Independent of the transfer size • ~ 20 ms on commodity disks • One large transfer is much cheaper than many small transfers • Random vs. Sequential I/O • I/O = Input/Output • Random I/O: Read/write an arbitrary page • Sequential I/O: Read/write the next page • Sequential I/O is 10-20x faster than random I/O • Avoiding random I/O will be a theme • Cost of sequential I/O is less • Prefetching easy when access patterns predictable

Query Planning Process • Many different query plans are possible • Logically equivalent ways of answering the same query • Job of the query optimizer: choose the most efficient plan for a given query • Most efficient = minimize resource consumption • Resources = CPU, disk I/O • Consume less resources → faster query responses • Design of query optimizer • Rule-based • Apply heuristic transformations (e.g. push selections down) • Cost-based: • Estimate cost (resource usage) of each possible query plan • Choose the plan with lowest estimated cost • Cost estimation uses statistics collected by the optimizer • In practice, optimizers use a combination of the two • Search space of possible plans is too large to exhaustively explore • Heuristics guide which types of plans are considered for cost-based optimization

SQL query parse parse tree Query rewriting statistics Best logical query plan Physical plan generation Best physical query plan execute result

Join Processing • OLAP queries involve joins • Fact table joins to dimension tables • Primary-key / Foreign-key joins • Choice of join algorithm has large impact on query cost • Naïve join processing algorithmTo join relations A and B: • Generate “Cartesian product” of A and B • Each A tuple is combined with each B tuple • Apply selection conditions to each tuple in the Cartesian product • Problem: Intermediate result can be huge! • Relation A has 10 million rows • Relation B has 100,000 rows • A x B has 1 trillion rows (many terabytes of data) • Need more efficient, logically equivalent procedure

Nested Loop Join (NLJ) R1 R2 • NLJ (conceptually) for each r  R1 do for each s  R2 do if r.C = s.C then output r,s pair

Nested Loop Join • In theory, multi-way nested loop joins are possible • Joining n relations → nnested “for” loops • In practice, only binary (2-way) joins operators are implemented • Joining n relations → Cascade of n-1 binary joins • Nested-loop join terminology: • Outer input: scanned in outer loop of join • Inner input: scanned in inner loop of join C A B

Analysis of Nested Loop Join • Outer has m tuples, inner has n tuples • CPU complexity = O(m*n) • For each of m outer tuples, compare with each of n inner tuples • Quadratic running time • What about I/O complexity? • Depends on amount of memory available • We’ll assume B tuples per page • Consider four scenarios • Use 2 pages of memory • Use (m/B + n/B) pages of memory • Use (n/B + 1) pages of memory • Use (k + 1) pages of memory • For k < m/B

Nested Loop Join (2 pages) • One page for outer, one page for inner • NLJ algorithm: • Load 1st outer block into memory • Load all n/B blocks of inner, one after another • Load 2nd outer block into memory • Load all n/B blocks of inner, one after another • … • Load (m/B)th outer block into memory • Load all n/B blocks of inner, one after another • Total I/O cost = m/B * (n/B + 1)

Nested Loop Join (m/B+n/B pages) • Load both inputs entirely into memory • All computation takes place in memory • No further paging necessary • Total I/O cost = m/B + n/B • Requires a lot of memory • Practical if both inputs are small

Nested Loop Join (n/B + 1 pages) • Allocate 1 page for outer • Load entire inner into memory (n/B pages) • NLJ algorithm: • Load entire inner into memory • Load 1st outer block into memory • Inner is already in memory → no additional I/O needed • Load 2nd outer block into memory • Inner is already in memory → no additional I/O needed • … • Load (m/B)th outer block into memory • Inner is already in memory → no additional I/O needed • Total I/O cost = m/B + n/B • Same I/O cost as previous slide • Practical if inner input is small

Nested Loop Join (k+1 pages) • Allocate 1 page for inner • Load k pages of outer into memory at a time • NLJ algorithm: • Load 1st group of k outer blocks into memory • Load all n/B blocks of inner, one after another • Load 2nd group of k outer block into memory • Load all n/B blocks of inner, one after another • … • Load (m/kB)th group of k outer blocks into memory • Load all n/B blocks of inner, one after another • Total I/O cost = m/kB * (n/B + k) • Reduction from 2-page case by a factor of k

Summary of NLJ Analysis • CPU usage is O(m*n) • Quadratic in size of inputs • I/O usage depends on memory allocated • If one input is small: • Make the small input the inner • I/O complexity is O(m/B + n/B) • Linear in size of inputs • The minimum possible • Otherwise: • Make the small input the outer • Slightly better than the other way around • I/O complexity is O(m/kB * (n/B + k)) • When k+1 pages of memory are available • Approximately mn/kB2

Improving on Nested Loops Join • For each outer tuple, we scan the entire inner relation to find matching tuples • Maybe we can do better by using data structures • Two main ideas: • Sorting • Hashing

Sorting • Using sorting to improve NLJ • First we sort the inner relation • Then, for each outer tuple, find matching inner tuple(s) using binary search over sorted inner • CPU usage improves • O(mn) → O(m log n) + sort cost • Sub-quadratic • This strategy is indexed nested-loop join • To be covered in more detail on Thursday • Implementation details • Use B-Tree instead of binary tree as search structure • Larger branching factor → better I/O efficiency

Merge Join • What if we sort both inputs instead of one? • No need to perform binary search for each outer tuple • Since outer tuples are also in sorted order, access to inner relation is structured • Binary search accesses pages in arbitrary order → random I/O • Merge join uses sequential I/O • CPU usage improves • O(mn) → O(m + n) + sort costs Random I/O for eachouter tuple Indexed NLJ Merge Join scan 1 3 2 1 2 3 4 5 6 7 1 2 3 4 5 6 7 5 2 7 3 5 1 4 4 5 1 2 3 4 4 5 5 5 7 scan scan

External-Memory Sorting • How to efficiently sort large data sets? • Consider I/O complexity • Sort a relation with n tuples, B tuples per block • Case 1: Use n/B pages of memory • Case 2: Use k+1 pages of memory • Case 1: Use n/B pages of memory • Read the relation into memory • n/B sequential disk reads • Sort in main memory using quicksort • No disk access required • Write the sorted output back to disk • n/B sequential disk writes • Total: 2n/B disk I/Os (all sequential) • Requires lots of memory → only useful for small inputs

External-Memory Sorting • Case 2: Use k pages of memory • k ≥ sqrt(n/B) • Two phases (run generation + merge) • Run generation • Repeat for each group of k blocks: • Read group of k blocks into memory • Sort using quicksort • Write sorted “run” back out to disk • Merge • Read 1st page from each run into memory • Merge to produce globally sorted output • Write output pages to disk as they are filled • As a page is exhausted, replace by next page in run • Each block read + written once per phase • Total disk I/Os: 4n/B Memory Disk Sorted Runs Final output

Hashing • Store inner input in a hash table • Tuples grouped together into buckets using a hash function • Apply hash function to join attributes (called the key) to determine bucket • Idea: all records with the same key must be in the same bucket • For each outer tuple • Hash key to determine which bucket it goes in • Scan that bucket for matching inner tuples • CPU usage improves • O(mn) → O(m + n) • Assuming a good hash function with few collisions • Collision = tuples with different keys hash to same bucket • Only useful for equality joins • Not for join conditions like WHERE A.foo > B.bar • 99% of all joins are equality joins

Hash Join Analysis • Hash join terminology • Inner input is called build input • Outer input is called probe input • Build a hash table based on one input, probe into that hash table using the other • Probe has m tuples, build has n tuples • Using (n/B + 1) pages of memory: • Allocate 1 page for probe input • Read build input into memory + create hash table • Uses n/B pages • Read 1st page of probe input • Hash each tuple and search bucked for matching build tuples • Requires no additional I/O • Repeat for 2nd and subsequent pages of probe input • I/O complexity = O(n/B + m/B) • Suitable when one input is small

Partitioned Hash Join • Two-phase algorithm (Partition + Join) • Uses k blocks of memory • Phase 1: Partition • Performed separately for build and probe inputs • Read 1st page into memory • Assign each tuple to a partition 1..k using hash function • Each partition is allocated 1 block of memory • Tuples assigned to same partition are placed together • When the block fills up, write it to disk • Each partition stored in a contiguous region on disk • Repeat for 2nd and subsequent pages • Phase 2: Join phase • Join Partition 1 of probe input with Partition 1 of build input using ordinary hash join • Use a secondary hash function • Repeat with Partition 2 of each input, etc. • A form of recursion • Partitioning creates smaller problem instances

Partition Phase Disk Memory Unsorted Input (Similar to merge phase of merge join, only backwards) Partitions

Summary of Join Methods • When 1 relation fits entirely in memory, all join methods have linear I/O complexity • When both relations are large: • NLJ has quadratic I/O complexity • Merge join and hash join have linear I/O complexity • CPU complexity • Quadratic for NLJ • N log N for merge join • Linear for hash join • Rules of Thumb: • Use nested loop join for very small inputs • Use merge join for large inputs that are already sorted • Use hash join for large, unsorted inputs

Query Plans with Multiple Joins • Two kinds of execution: • Pipelined • No need to store intermediate results on disk • Available memory must be shared between joins • Materialized • No memory contention between joins • Spend I/Os writing intermediate results to disk • Blocking operators • Sort, partition, build hash table • Pipelined execution is disallowed • Interesting orders • Sorted input → sorted output • NLJ: output is sorted by same key as outer input • Merge join: output is sorted by same key as inputs • Partitioned input → partitioned output • Hash join: output is partitioned by same key as inputs • Take advantage of interesting orders • Eliminate sorting / partitioning phases of later joins C A B

Grouping • First partition tuples on grouping attributes • Tuples in same group placed together • Tuples in different groups separated • Then scan tuples in each partition, compute aggregate expressions • Two techniques for partitioning • Sorting • Sort by the grouping attributes • All tuples with same grouping attributes will appear together in sorted list • Hashing • Hash by the grouping attributes • All tuples with same grouping attributes will hash to same bucket • Sort / re-hash within each bucket to resolve collisions

CS 345: Topics in Data Warehousing