CS4432: Database Systems II

CS4432: Database Systems II Cost and Size Estimation

Overview of Query Execution The size at these two points affects which join algorithm to choose Affects which physical plan to select Affects the cost

Common Statistics over Relation R • B(R): # of blocks to hold all R tuples • T(R): # tuples in R • S(R): # of bytes in each of R’s tuple • V(R, A): # distinct values in attribute R.A We care about computing these statistics for each intermediate relation

Requirements for Estimation Rules • Give accurate estimates • Are easy (fast) to compute • Are logically consistent: estimated size should not depend on how the relation is computed Here we describe some simple heuristics.

Estimating Size of SelectionU = sp (R) • Equality Condition: R.A = c, where c is a constant • Reasonable estimate T(U) = T(R) / V(R,A) • That is: Original number of tuples divided by number of different values of A • Range Condition: c1 < R.A < c2: • If R.A domain is known D  T(U) =T(R) x (c2- c1)/D • Otherwise  T(U) = T(R)/3 • Non-Equality Condition: R.A ≠ c • A good estimate  T(U) = T(R )

Example • Consider relation R(a,b,c) with 10,000 tuples and 50 different values for attribute a. • Consider selecting all tuples from R with (a = 10 and b < 20). • Estimate of number of resulting tuples: • 10,000*(1/50)*(1/3) = 67. If condition is the AND of several predicates  estimate in series.

Estimating Size of Selection (Cont’d) If condition has the form C1 OR C2, use: • Sumof estimate for C1 and estimate for C2, Or • Assuming C1 and C2 are independent, T(R)*(1  (1f1)*(1f2)), where f1 is fraction of R satisfying C1 and f2 is fraction of R satisfying C2 Select from R with (a = 10 or b < 20) • R(a,b) 10,000 tuples and 50 different values for a. • Estimate • Estimate for a = 10 is 10,000/50 = 200 • Estimate for b < 20 is 10,000/3 = 3333 • Estimate for combined condition is • 200 + 3333 = 3533, OR • 10,000*(1  (1  1/50)*(1  1/3)) = 3466 • Different, but not really

Estimating Size of Natural Join U = R S • Assume join is on a single attribute Y. • Some Possibilities: • R and S have disjoint sets of Y values, so size of join is 0 • Y is the key of S and a foreign key of R, so size of join is T(R) • All the tuples of R and S have the same Y value, so size of join is T(R)*T(S) • We need some assumptions… • Expected number of tuples in result is: • T(U) = T(R)*T(S) / max(V(R,Y),V(S,Y))

For Joins U = R1(A,B) R2(A,C) T(U) = T(R1) x T(R2) / max(V(R1,A), V(R2,A)) What are different V(U,*) values? V(U,A) = min { V(R1, A), V(R2, A) } V(U,B) = V(R1, B) V(U,C) = V(R2, C) Property: “preservation of value sets”

Example: Z = R1(A,B) R2(B,C) R3(C,D) T(R1) = 1000 V(R1,A)=50 V(R1,B)=100 T(R2) = 2000 V(R2,B)=200 V(R2,C)=300 T(R3) = 3000 V(R3,C)=90 V(R3,D)=500 R1 R2 R3

Partial Result: U = R1 R2 T(U) = 10002000 200 V(U,A) = 50 V(U,B) = 100 V(U,C) = 300

Z = U R3 T(Z) = 100020003000 200300 V(Z,A) = 50 V(Z,B) = 100 V(Z,C) = 90 V(Z,D) = 500

40 30 20 10 20 30 10 40 2 3 3 1 2 1 3 8 4 2 0 1 2 4 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 More on Estimation • Uniform distribution is not accurate since real data is not uniformly distributed. • Histogram: • A data structure maintained by a DBMS to approximate a data distribution. • Divide range of column values into subranges (buckets). Assume distribution within histogram bucket is uniform. number of tuples in R with A value in given range

Summary of Estimation Rules • Projection: exactly computable • Product: exactly computable • Selection: reasonable heuristics • Join: reasonable heuristics • The other operators are harder to estimate…

CS4432: Database Systems II

CS4432: Database Systems II

Presentation Transcript

DATABASE CONCEPTS

Database Operations on GPU

Compiler Concepts for Database Systems

Database Systems: Design, Implementation, and Management Ninth Edition

Mobile Database Systems

Database Security and Authorization

Object oriented Database

DB Programming

Chapter 6 DATABASES, DATA WAREHOUSES AND OLAP

Database Access using SQL

Database Management Systems

INFORMATION SYSTEMS IN BIOLOGY: FOCUS ON DATABASE

Progress Database Admin

Microsoft Access

INFO2120 – INFO2820 – COMP5138 Database Systems

Database Theory

Oracle Database Systems

Chapter 2- From Systems to Descriptive Models

Oracle Database 12c Release 1 (12.1.0.2 )

ICOM 5016 – Introduction to Database Systems