300 likes | 333 Views
Query Optimization 3 Cost Estimation. R&G, Chapters 12, 13, 14 Lecture 15. Administrivia. Homework 2 Due Tonight Remember you have 4 slip days for the course Homeworks 3 & 4 available later this week 3 is written assignment, deals with optimization, due before midterm 2
E N D
Query Optimization 3Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15
Administrivia • Homework 2 Due Tonight • Remember you have 4 slip days for the course • Homeworks 3 & 4 available later this week • 3 is written assignment, deals with optimization, due before midterm 2 • 4 is programming assignment, implementing query processing, due after Spring Break • Midterm 2 is 3/22, 2 weeks from Thursday
Review: Query Processing • Queries start out as SQL • Database translates SQL to one or more Relational Algebra plans • Plan is a tree of operations, with access path for each • Access path is how each operator gets tuples • If working directly on table, can use scan, index • Some operators, like sort-merge join, or group-by, need tuples sorted • Often, operators pipelined, getting tuples that are output from earlier operators in the tree • Database estimates cost for various plans, chooses least expensive
Today • Review costs for: • Sorting • Selection • Projection • Joins • Re-examine Hashing for: • Projection • Joins
General External Merge Sort • More than 3 buffer pages. How can we utilize them? • To sort a file with N pages using B buffer pages: • Pass 0: use B buffer pages. Produce sorted runs of B pages each. • Pass 1, 2, …, etc.: merge B-1 runs. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers
Cost of External Merge Sort • Minimum amount of memory: 3 pages • Initial runs of 3 pages • Then 2-way merge of sorted runs (2 pages for inputs, one for outputs) • #of passes: 1 + log2(N/3) • With more memory, fewer passes • With B pages, #of passes: 1 + log(B-1)(N/B) • I/O Cost = 2N * (# of passes)
External Sort Example • E.g., with 5 buffer pages, to sort 108 page file: • Pass 0: • 22 sorted runs of 5 pages each (last run is only 3 pages) • Now, do four-way (B-1) merges • Pass 1: • 6 sorted runs of 20 pages each (last run is only 8 pages) • Pass 2: • 2 sorted runs, 80 pages and 28 pages • Pass 3: • Sorted file of 108 pages
Using B+ Trees for Sorting • Scenario: • Table to be sorted has B+ tree index on sorting column(s). • Idea: • Retrieve records in order by traversing leaf pages. • Is this a good idea? Cases to consider: • B+ tree is clusteredGood idea! • B+ tree is not clusteredCould be a very bad idea! • I/O Cost • Clustered tree: ~ 1.5N • Unclustered tree: 1 I/O per tuple, worst case!
Selections: “age < 20”, “fname = Bob”, etc • No index • Do sequential scan over all tuples • Cost: N I/Os • Sorted data • Do binary search • Cost: log2(N) I/Os • Clustered B-Tree • Cost: 2 or 3 to find first record + 1 I/O for each #qualifying pages • Unclustered B-Tree • Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple • Clustered Hash Index • Cost: ~1.2 I/Os to find bucket, all tuples inside • Unclustered Hash Index • Cost: ~1.2 I/Os to find bucket, + ~1 I/O for each matching tuple
Selection Exercise • Sal > 100 • Sequential Scan: 100 I/Os • Btree on sal? Unclustered, 503 I/Os • BTree on <age, sal>? Can’t use it • Age = 25 • Sequential Scan: 100 I/Os • Hash on age: 1.2 I/O to get to bucket • 20 matching tuples, 1 I/O for each • BTree <age,sal>: ~3 I/Os to find leaf + #matching pages • 20 matching tuples, clustered, ~ 2 I/Os • Age > 20 • Sequential Scan: 100 I/Os • Hash on age? Not with range query. • BTree <age, sal>: ~3 I/Os to find leaf + #matching pages • (Age > 20) is 90% of pages, or ~90*1.5 = 135 I/Os
Selection Exercise (cont) • Eid = 1000 • Sequential Scan: ~50 I/Os (avg) • Hash on eid: ~1.2 I/Os to find bucket, 1 I/O to get record • Sal > 100 and age < 30 • Sequential Scan: 100 I/Os • Btree <age, sal>: ~ 3 I/Os to find leaf, 30% of pages match, so 30*1.5 = 45 I/Os
Projection • Expensive when eliminating duplicates • Can do this via: • Sorting: cost no more than external sort • Cheaper if you project columns in initial pass, since more projected tuples fit in each page. • Hashing: build a hash table, duplicates will end up in the same bucket
An Alternative to Sorting: Remove duplicates with Hashing! • Idea: • Many of the things we use sort for don’t exploit the order of the sorted data • e.g.: removing duplicates in DISTINCT • e.g.: finding matches in JOIN • Often good enough to match all tuples with equal values • Hashing does this! • And may be cheaper than sorting! (Hmmm…!) • But how to do it for data sets bigger than memory??
General Idea • Two phases: • Partition: use a hash function h to split tuples into partitions on disk. • Key property: all matches live in the same partition. • ReHash: for each partition on disk, build a main-memory hash table using a hash function h2
Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk Two Phases • Partition: • Rehash: Result Partitions Hash table for partition Ri (<= B pages) hash fn h2 B main memory buffers Disk
Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk Duplicate Elimination using Hashing • read one bucket at a time • for each group of identical tuples, output one Result Partitions Hash table for partition Ri (<= B pages) hash fn h2 B main memory buffers Disk
Hashing in Detail • Two phases, two hash functions • First pass: partition into (B-1) buckets • E.g., B = 5 pages, h(x) is two low order bit Output Files Input File Memory 6 2 1 9 4 1 5 6 1
Memory, I/O costs Requirement • If we can hash in two passes -> cost is 4N • How big of a table can we hash in two passes? • B-1 “partitions” result from Phase 0 • Each should be no more than B pages in size • Answer: B(B-1). Said differently: We can hash a table of size N pages in about space • Note: assumes hash function distributes records evenly! • Have a bigger table? Recursive partitioning!
Memory Requirement for External Sorting • How big of a table can we sort in two passes? • Each “sorted run” after Phase 0 is of size B • Can merge up to B-1 sorted runs in Phase 1 • Answer: B(B-1). Said differently: We can sort a table of size N pages in about space • Have a bigger table? Additional merge passes!
So which is better ?? • Based on our simple analysis: • Same memory requirement for 2 passes • Same IO cost • Digging deeper … • Sorting pros: • Great if input already sorted (or almost sorted) • Great if need output to be sorted anyway • Not sensitive to “data skew” or “bad” hash functions • Hashing pros: • Highly parallelizable(will discuss later in semester) • Can exploit extra memory to reduce # IOs (stay tuned…)
Nested Loops Joins • R, with M pages, joins S, with N Pages • Nested Loops • Simple nested loops • Insanely inefficient M + PR*M*n • Paged nested loops – only 3 pages of memory • M + M*N • Blocked nested loops – B pages of memory • M + M/(B-2) * N • If M fits in memory (B-2), cost only M + N • Index nested loops • M + PR*M* index cost • Only good in M verysmall
Sort-Merge Join • Simple case: • sort both tables on join column • Merge • Cost: external sort cost + merge cost • 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N • Optimized Case: • If we have enough memory, do final merge and join in same pass. This avoids final write pass from sort, and read pass from merge • Can we merge on 2nd pass? Only in #runs from 1st pass < B • #runs for R is M/B. #runs for S is N/B. • Total #runs ~~ (M+N)/B • Can merge on 2nd pass if M+N/B < B, or M+N < B2 • Cost: 3(M+N)
Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk Partitions of R & S Join Result Hash table for partition Ri (B-2 pages) hash fn h2 h2 Output buffer Input buffer for Si B main memory buffers Disk Disk Hash Join
Cost of Hash Join • Partitioning phase: read+write both relations 2(|R|+|S|) I/Os • Matching phase: read+write both relations |R|+|S| I/Os • Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass merge-sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?
cost 3N N An important optimization to hashing • Have B memory buffers • Want to hash relation of size N # passes 2 1 N B2 B If B < N < B2, will have unused memory …
Hybrid Hashing • Idea: keep one of the hash buckets in memory! Original Relation k-buffer hashtable Partitions OUTPUT 2 2 1 3 h3 3 INPUT . . . h B-k B-k B main memory buffers Disk Disk Q: how do we choose the value of k?
Cost reduction due to hybrid hashing • Now: # passes cost 2 3N 1 N N B2 B
Summary: Hashing vs. Sorting • Sorting pros: • Good if input already sorted, or need output sorted • Not sensitive to data skew or bad hash functions • Hashing pros: • Often cheaper due to hybrid hashing • For join: # passes depends on size of smaller relation • Highly parallelizable
Summary • Several alternative evaluation algorithms for each operator. • Query evaluated by converting to a tree of operators and evaluating the operators in the tree. • Must understand query optimization in order to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). • Two parts to optimizing a query: • Consider a set of alternative plans. • Must prune search space; typically, left-deep plans only. • Must estimate cost of each plan that is considered. • Must estimate size of result and cost for each plan node. • Key issues: Statistics, indexes, operator implementations.