1 / 51

Temple University – CIS Dept. CIS616– Principles of Data Management

Temple University – CIS Dept. CIS616– Principles of Data Management. V. Megalooikonomou Query Processing / Optimization (based on notes by C. Faloutsos at CMU). Data-files. catalog. Overview of a DBMS. Naïve user. casual user. DBA. DML parser. DDL parser. DML precomp. trans. mgr.

Download Presentation

Temple University – CIS Dept. CIS616– Principles of Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS616– Principles of Data Management V. Megalooikonomou Query Processing / Optimization (based on notes by C. Faloutsos at CMU)

  2. Data-files catalog Overview of a DBMS Naïve user casual user DBA DML parser DDL parser DML precomp. trans. mgr buffer mgr

  3. Overview - detailed • Motivation - Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  4. Why Q-opt? • SQL: ~declarative • good q-opt -> big difference • e.g., seq. Scan vs B-tree index, on P=1,000 pages

  5. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alternative plans estimate cost; pick best

  6. Q-opt - example p s TAKES STUDENT Canonical form p select name from STUDENT, TAKES where c-id=‘CIS616’ and STUDENT.ssn=TAKES.ssn s STUDENT TAKES

  7. Q-opt - example p s TAKES STUDENT Hash join; merge join; nested loops; Index; seq. scan

  8. Overview - detailed • Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  9. Equivalence of expressions • … or syntactic q-opt • In short: perform selections and projections early • More details: see transformation rules in text

  10. Equivalence of expressions • Q: How to prove a transformation rule? • A: use TRC, to show that LHS = RHS, e.g.:

  11. Equivalence of expressions

  12. Equivalence of expressions

  13. Equivalence of expressions • Selections • perform them early • break a complex predicate, and push • simplify a complex predicate • (‘X=Y and Y=3’) -> ‘X=3 and Y=3’

  14. Equivalence of expressions • Projections • perform them early (but carefully…) • Smaller tuples • Fewer tuples (if duplicates are eliminated) • project out all attributes except the ones requested or required (e.g., joining attributes)

  15. Equivalence of expressions • Joins • Commutative , associative • Q: n-way join - how many diff. orderings? … Exhaustive enumeration too slow…

  16. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best

  17. Cost estimation • E.g., find ssn’s of students with an ‘A’ in CIS616 (using seq. scanning) • How long will a query take? • CPU (but: small cost; decreasing; tough to estimate) • Disk (mainly, # block transfers) • How many tuples will qualify? • (what statistics do we need to keep?)

  18. Sr #1 #2 #3 … #nr Cost estimation • Statistics: for each relation ‘r’ we keep • nr : # tuples; • Sr : size of tuple in bytes • V(A,r): number of distinct values of attr. ‘A’ • (recently, histograms, too)

  19. Sr fr #1 #2 … #br Derivable statistics • fr: blocking factor = max# records/block (=?? ) • br: # blocks (=?? ) • SC(A,r) = selection cardinality = avg# of records with A=given (=?? )

  20. Derivable statistics • fr: blocking factor = max# records/block (= B/Sr ; B: block size in bytes) • br: # blocks (= nr / fr )

  21. Derivable statistics • SC(A,r) = selection cardinality = avg# of records with A=given (= nr / V(A,r) ) (assumes uniformity...) – eg: 30,000 students, 10 colleges – how many students in CST?

  22. Additional quantities we need: • For index ‘i’: • fi: average fanout - degree (~50-100) • HTi: # levels of index ‘i’ (~2-3) • ~ log(#entries)/log(fi) • LBi: # blocks at leaf level HTi

  23. Statistics • Where do we store them? • How often do we update them?

  24. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans selections; sorting; projections joins estimate cost; pick best

  25. Sr fr #1 #2 … #br Cost estimation + plan generation • Selections – e.g., select * from TAKES where grade = ‘A’ • Plans?

  26. Sr fr #1 #2 … #br Cost estimation + plan generation • Plans? • seq. scan • binary search • (if sorted & consecutive) • index search • if an index exists

  27. Sr fr #1 #2 … #br Cost estimation + plan generation seq. scan – cost? • br (worst case) • br/2 (average, if we search for primary key)

  28. Sr fr #1 #2 … #br Cost estimation + plan generation binary search – cost? if sorted and consecutive: • ~log(br) + • SC(A,r)/fr (=#blocks spanned by qualified tuples) • -1

  29. Sr fr #1 #2 … #br Cost estimation + plan generation estimation of selection cardinalities SC(A,r): non-trivial

  30. Sr fr #1 #2 … #br Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples ... case#1: primary key case#2: sec. key – clustering index case#3: sec. key – non-clust. index

  31. Sr fr #1 #2 … #br HTi Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples .. case#1: primary key – cost: HTi + 1

  32. HTi Cost estimation + plan generation Sr method#3: index - cost? • levels of index + • blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index OR prim. index on non-key …retrieve multiple records HTi + SC(A,r)/fr … #br

  33. Sr fr #1 #2 … #br Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples ... case#3: sec. key – non-clust. index HTi + SC(A,r) (actually, pessimistic...)

  34. Cost estimation – arithmetic examples find accounts with branch-name = ‘Perryridge’ account(branch-name, balance, ...)

  35. Arithm. examples – cont’d • n-account = 10,000 tuples • f-account = 20 tuples/block • V(balance, account) = 500 distinct values • V(branch-name, account) = 50 distinct values • for branch-index: fanout fi = 20

  36. Arithm. examples • Q1: cost of seq. scan? • A1: 500 disk accesses • Q2: assume a clustering index on branch-name – cost?

  37. HTi Cost estimation + plan generation Sr method#3: index – cost? • levels of index + • blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index HTi + SC(A,r)/fr … #br

  38. Arithm. examples • A2: HTi + SC(branch-name, account)/f-account • HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+) • SC(..)= # qualified records = • nr/V(A,r) = 10,000/50 = 200 tuples • SC/f: spanning 200/20 blocks = 10 blocks

  39. Arithm. examples • A2 final answer: 2+10= 12 block accesses • (vs. 500 block accesses of seq. scan) • footnote: in all fairness • seq. disk accesses: ~2msec or less • random disk accesses: ~10msec

  40. Overview - detailed • Motivation - Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  41. 2-way joins algorithm(s) for r JOIN s? nr, ns tuples each r(A, ...) s(A, ......) nr ns

  42. 2-way joins Algorithm #0: (naive) nested loop (SLOW!) for each tuple tr of r for each tuple ts of s print, if they match r(A, ...) s(A, ......) nr ns

  43. 2-way joins Algorithm #0: why is it bad? how many disk accesses (‘br’ and ‘bs’ are the number of blocks for ‘r’ and ‘s’)? r(A, ...) s(A, ......) nr ns nr*bs + br

  44. 2-way joins Algorithm #1: Blocked nested-loop join read in a block of r read in a block of s print matching tuples cost: br + br * bs r(A, ...) s(A, ......) nr, br ns records, bs blocks

  45. 2-way joins Arithmetic example: nr = 10,000 tuples, br = 1,000 blocks ns = 1,000 tuples, bs = 200 blocks alg#0: 2,001,000 d.a. alg#1: 201,000 d.a. r(A, ...) s(A, ......) 10,000 1,000 1,000 records, 200 blocks

  46. 2-way joins Observation1: Algo#1: asymmetric: cost: br + br * bs- reverse roles: cost= bs + bs*br Best choice? smallest relation in outer loop r(A, ...) s(A, ......) nr, br ns records, bs blocks

  47. 2-way joins Other algorithm(s) for r JOIN s? nr, ns tuples each r(A, ...) s(A, ......) nr ns

  48. 2-way joins - other algo’s sort-merge sort ‘r’; sort ‘s’; merge sorted versions (good, if one or both are already sorted) r(A, ...) s(A, ......) nr ns

  49. hash join: hash ‘r’ into (0, 1, ..., ‘max’) buckets hash ‘s’ into buckets (same hash function) join each pair of matching buckets 2-way joins - other algo’s s(A, ......) r(A, ...) 0 1 max

  50. More heuristics by Oracle, Sybase and Starburst (-> DB2) : in book In general: q-opt is very important for large databases. (‘explain select <sql-statement>’ gives plan) Structure of query optimizers:

More Related