150 likes | 169 Views
Explore access path selection, query evaluation, indexing, clustering, cost models, selectivity, join methods, ordering, and optimizations in relational database management systems. Learn about Selinger's approach and efficient join strategies.
E N D
Access Path Selection in an RDBMS • ICS 624 Spring 2011 • Zachary Heilbron
Motivation • SQL: What to get, not how to get it • Up to the system to figure out how • Of course, speed is ideal
Query Evaluation • What are some ways to evaluate a query? • Selinger’s approach: Consider different access paths that look at a subset of relevant tuples, decide which paths make sense, assign costs to each path, and execute the least cost plan
Quick Refresher • Page: holds data records and is a logical unit of transfer between disk and memory • Page sizes in System R are4KB • Segment: a logical unit that represents a collection of pages • Guaranteed to contain at least an entire relation (relations cannot span segments) Is this page size reasonable for modern DBMSs? Segments analogous to heap/file?
Access Paths • Segment Scan: Scan each data page in the segment and return tuples belonging to the given relation • Index Scan: Scan each index page and return the associated tuples What are the advantages of providing sargable predicates to these scans? Is an index scan always faster than a segment scan?
Indexing • Entries for index pages: • <key, data> (data entry, k*) • <key, rid> (pointer) [System R] • <key, list of rids> (list of pointers) What are the advantages of each? How are indexes implemented?
Clustering • Clustered Index: data entries that the index references are ordered (or close to) on the indexing key How are clustered indexes maintained? Which index entry type is best for clustering?
Cost Model • Based on disk I/O and CPU utilization C = PageFetches + W*(RSI CALLS) Where does a DBMS spend most of its time? DBMSs disk I/Os come from page fetches, but when is CPU utilization a factor?
Selectivity • Selectivity Factor: corresponds to the fraction of tuples to be returned • Cataloged statistics (relation and index cardinalities) used in assigning selectivity factors to boolean factors • Based on predefined formulas Are these cataloged statistics a good way to estimate selectivity?
Join Methods • Nested Loop Join: For each tuple of R, scan all tuples of S returning those that match • Merging Scans or (sort)-merge join: Maintain pointers to tuples in the sorted lists, advance pointers looking for matches • If neither relations are indexed on the joining column, then sort them into temporary lists What other join methods exist?
Join Ordering • All orderings produce the same result, but produce different costs • n! ways to order joins - ABCD, ABDC, ... • (n-1)! plans per ordering - (((AB)C)D), ... • Naïve approach infeasible How can we narrow the search space down?
Join Ordering • Selinger’s Optimizations: • Prune the search space by considering only left-deep plans (((AB)C)D) • Push-down selections • Consider successively larger subsets, saving cross products for last (i.e., make use of selective predicates!) • Dynamic programming Are left-deep plans always better?
Selinger’s Solution • Find the best way to access a single relation • Find the best way to join another relation to the single relation • Find the best way to join another relation to the pair of relations, and so on... • nC1 + nC2 + ... + nCn = 2^n plans considered for each interesting order (or equivalence class) Exponential better than factorial!
Interesting Orders • Multiple solutions are maintained: • “Unordered” solution • Ordered solutions for each interesting order (or equivalence class) • At the end, compare the cost of each ordered solution and the unordered solution plus the cost of sorting
Solutions • What does a solution (QEP) look like? • the order to perform the joins • the method used to perform each join • the access path for each relation • any additional sorting (if necessary)