240 likes | 349 Views
Deriving Predicate Statistics (SDP) in Datalog Principles and Practice of Declarative Programming 12 th International ACM SIGPLAN Symposium July 26, 2010, Hagenberg, Austria. Senlin Liang and Michael Kifer Stony Brook University. Summary of Our Approach. Motivation
E N D
Deriving Predicate Statistics (SDP) in DatalogPrinciples and Practice of Declarative Programming12th International ACM SIGPLAN SymposiumJuly 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer Stony Brook University
Summary of Our Approach • Motivation • Take advantage of cost-based optimizations in deductive database systems • Compute cost information (predicate statistics) • Store and retrieve cost information efficiently • Apply optimization techniques • Advantages of our approach • Keeps argument dependencies • Handles recursion • Handles negation “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Outline • Introduction • Traditional approach: histograms + argument independence assumption • Error grows exponentially • SDP • Dependency matrix stores predicate statistics • Abstract interpretation of Datalog rules, which are evaluated over dependency matrices • Experimental studies • Future work “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Histograms • Data distribution: T=((v1, f1), ……, (vn,fn)). • E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) • Histograms • Partition data distribution into groups • Summarize each group as a bucket: (floor, ceiling, size, count) • Compute the values and frequencies in each bucket efficiently • MaxDiff histograms with β buckets • Partition T using β-1 largest frequency differences “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: MaxDiff Histograms (3 buckets) 1 2 1. Partition T using 2 largest frequency differences 2. Summarize as (floor, ceiling, size, count) 3. Value-frequency approximation vals(bucket) = [floor, ceiling]; f(val) = count/size, e.g. f(7)=5/3 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o 1 1 2 2 1 0 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) (2,4,3,4) (5,5,1,3) (6,8,3,5) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Argument Independence Assumption • Common in database size estimates • Data distributions of different arguments are independent of each other • For example, in predicate p(X,Y), the data distributions of X and Y are independent • Joint data distribution can be easily computed from individual distributions E.g., p(X=a, Y=b) = p(X=a) × p(Y=b) • Unfortunately, the independence assumption is almost always wrong in real datasets “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Histogram+Independence = Poor Estimate • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Facts: e(2,2), … as in Example 1 of the paper. • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Size estimate • Answer size estimate for each bucket size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count • size(answer) = 6.33 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Histogram+Independence = Poor Estimate • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Histogram buckets of answer • X: (5,5,1,3) (6,7,2,3.33) • Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) • answer.count = e.count ×size(answer)/size(e) • Real results for answer.Y • (1,1,1,0) (2,4,3,0) (5,8,4,6) • Independence causes information loss “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Our Approach: Dependency Matrices Only considers dependency matrices (DM) for binary predicates Partitions facts into localgroups Sum up the groups into DM values Sum up each row/column into (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values • Sum up each row/column, into (floor,ceiling,size) (2,4,3) (5,8,4) (1,1,1) (2,4,3) (5,5,1) (6,8,3) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Selection by Example • From fact matrix, we know that • size(answer) • = ΣF(i,j) for 5 ≤ i≤ 7 • = 6 answer(X,Y) :- e(X,Y), 5 ≤X≤7. “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
(5,8,4) (2,4,3) (1,1,1) SDP for Selection by Example (2,4,3) (5,5,1) (6,8,3) (5,8,4) (2,4,3) (1,1,1) (5,5,1) (6,7,2) • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Extract the portions covered by the selection • Recompute matrix values • Sum them up as size(answer)=3+.67+.67+2 =6.34 • For each row, recompute (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Sort-Merge-Join • answer(X,Z) :- a(X,Y), b(Y,Z) • middle(X,Y,Z) is for the ease of explanation ...... …… a(4,3) b(3,1) a(4,4) b(3,5) b(4,5) …… …… answer (4,1) (4,5) (4,5) …… middle (4,3,1) (4,3,5) (4,4,5) …… Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Simulate Sort-Merge-Join (1,1,1) (6,8,2) (2,4,2) (9,9,1) (2,4,3) (1,1,1) (5,5,1) (2,4,2) align A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 • Result size of middle(X,Y,Z) can be estimated as • min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size) • Examples: • size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1) • size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Examples: • middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2)) • middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2)) • Three duplicate handling approaches • Sum: no duplicate removal • Max: most aggressive removal • Expected sum: remove “expected” number of duplicates Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Recursive Predicates • Recursive predicates are computed incrementally until they reach approximate fixed points • Size reaches α-approximate fixed point if Δ(size)/size ≤ α where • Δ(…) is the difference between two consecutive iterations in fixed point computation • 0 ≤ α ≤ 1 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Recursive Predicates • Transitive closure path(X,Y) :- edge(X,Y). (base) path(X,Y) :- edge(X,Z), path(Z,Y). (rec) • Computation of the estimate: • Compute size(path) and DM(path) using rule base • Compute size(path) and DM(path) using rule rec as in the case of a join • If size(path) reaches approximate fixed points, stop; Otherwise, go to step 2 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Experimental Studies • Test programs: • Transitive closure • General same generation • Datasets: generated with Thomas Process and Matern Cluster Process • Results • SDP estimates converge to real sizes for recursive predicates • Expected sum is good for duplicate removal • Details in the paper “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Experimental Studies • SDP estimates converge to real sizes for recursive predicates Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Experimental Studies • Expected sum is good for duplicate removal Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Conclusion Dependency matrix for binary predicates Overcomes problems with argument independence assumption SDP for selection, join, and recursion Experimental validations “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive systems, such as XSB “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer