150 likes | 378 Views
Access Path Selection in an RDBMS. ICS 624 Spring 2011 Zachary Heilbron. Motivation. SQL: What to get, not how to get it Up to the system to figure out how Of course, speed is ideal. Query Evaluation. What are some ways to evaluate a query?
E N D
Access Path Selection in an RDBMS • ICS 624 Spring 2011 • Zachary Heilbron
Motivation • SQL: What to get, not how to get it • Up to the system to figure out how • Of course, speed is ideal
Query Evaluation • What are some ways to evaluate a query? • Selinger’s approach: Consider different access paths that look at a subset of relevant tuples, decide which paths make sense, assign costs to each path, and execute the least cost plan
Quick Refresher • Page: holds data records and is a logical unit of transfer between disk and memory • Page sizes in System R are4KB • Segment: a logical unit that represents a collection of pages • Guaranteed to contain at least an entire relation (relations cannot span segments) Is this page size reasonable for modern DBMSs? Segments analogous to heap/file?
Access Paths • Segment Scan: Scan each data page in the segment and return tuples belonging to the given relation • Index Scan: Scan each index page and return the associated tuples What are the advantages of providing sargable predicates to these scans? Is an index scan always faster than a segment scan?
Indexing • Entries for index pages: • <key, data> (data entry, k*) • <key, rid> (pointer) [System R] • <key, list of rids> (list of pointers) What are the advantages of each? How are indexes implemented?
Clustering • Clustered Index: data entries that the index references are ordered (or close to) on the indexing key How are clustered indexes maintained? Which index entry type is best for clustering?
Cost Model • Based on disk I/O and CPU utilization C = PageFetches + W*(RSI CALLS) Where does a DBMS spend most of its time? DBMSs disk I/Os come from page fetches, but when is CPU utilization a factor?
Selectivity • Selectivity Factor: corresponds to the fraction of tuples to be returned • Cataloged statistics (relation and index cardinalities) used in assigning selectivity factors to boolean factors • Based on predefined formulas Are these cataloged statistics a good way to estimate selectivity?
Join Methods • Nested Loop Join: For each tuple of R, scan all tuples of S returning those that match • Merging Scans or (sort)-merge join: Maintain pointers to tuples in the sorted lists, advance pointers looking for matches • If neither relations are indexed on the joining column, then sort them into temporary lists What other join methods exist?
Join Ordering • All orderings produce the same result, but produce different costs • n! ways to order joins - ABCD, ABDC, ... • (n-1)! plans per ordering - (((AB)C)D), ... • Naïve approach infeasible How can we narrow the search space down?
Join Ordering • Selinger’s Optimizations: • Prune the search space by considering only left-deep plans (((AB)C)D) • Push-down selections • Consider successively larger subsets, saving cross products for last (i.e., make use of selective predicates!) • Dynamic programming Are left-deep plans always better?
Selinger’s Solution • Find the best way to access a single relation • Find the best way to join another relation to the single relation • Find the best way to join another relation to the pair of relations, and so on... • nC1 + nC2 + ... + nCn = 2^n plans considered for each interesting order (or equivalence class) Exponential better than factorial!
Interesting Orders • Multiple solutions are maintained: • “Unordered” solution • Ordered solutions for each interesting order (or equivalence class) • At the end, compare the cost of each ordered solution and the unordered solution plus the cost of sorting
Solutions • What does a solution (QEP) look like? • the order to perform the joins • the method used to perform each join • the access path for each relation • any additional sorting (if necessary)