Senlin Liang and Michael Kifer Stony Brook University

Deriving Predicate Statistics (SDP) in DatalogPrinciples and Practice of Declarative Programming12th International ACM SIGPLAN SymposiumJuly 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer Stony Brook University

Summary of Our Approach • Motivation • Take advantage of cost-based optimizations in deductive database systems • Compute cost information (predicate statistics) • Store and retrieve cost information efficiently • Apply optimization techniques • Advantages of our approach • Keeps argument dependencies • Handles recursion • Handles negation “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Outline • Introduction • Traditional approach: histograms + argument independence assumption • Error grows exponentially • SDP • Dependency matrix stores predicate statistics • Abstract interpretation of Datalog rules, which are evaluated over dependency matrices • Experimental studies • Future work “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Histograms • Data distribution: T=((v1, f1), ……, (vn,fn)). • E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) • Histograms • Partition data distribution into groups • Summarize each group as a bucket: (floor, ceiling, size, count) • Compute the values and frequencies in each bucket efficiently • MaxDiff histograms with β buckets • Partition T using β-1 largest frequency differences “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: MaxDiff Histograms (3 buckets) 1 2 1. Partition T using 2 largest frequency differences 2. Summarize as (floor, ceiling, size, count) 3. Value-frequency approximation vals(bucket) = [floor, ceiling]; f(val) = count/size, e.g. f(7)=5/3 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o 1 1 2 2 1 0 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) (2,4,3,4) (5,5,1,3) (6,8,3,5) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Argument Independence Assumption • Common in database size estimates • Data distributions of different arguments are independent of each other • For example, in predicate p(X,Y), the data distributions of X and Y are independent • Joint data distribution can be easily computed from individual distributions E.g., p(X=a, Y=b) = p(X=a) × p(Y=b) • Unfortunately, the independence assumption is almost always wrong in real datasets “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: Histogram+Independence = Poor Estimate • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Facts: e(2,2), … as in Example 1 of the paper. • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Size estimate • Answer size estimate for each bucket size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count • size(answer) = 6.33 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: Histogram+Independence = Poor Estimate • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Histogram buckets of answer • X: (5,5,1,3) (6,7,2,3.33) • Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) • answer.count = e.count ×size(answer)/size(e) • Real results for answer.Y • (1,1,1,0) (2,4,3,0) (5,8,4,6) • Independence causes information loss “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Our Approach: Dependency Matrices Only considers dependency matrices (DM) for binary predicates Partitions facts into localgroups Sum up the groups into DM values Sum up each row/column into (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values • Sum up each row/column, into (floor,ceiling,size) (2,4,3) (5,8,4) (1,1,1) (2,4,3) (5,5,1) (6,8,3) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

SDP for Selection by Example • From fact matrix, we know that • size(answer) • = ΣF(i,j) for 5 ≤ i≤ 7 • = 6 answer(X,Y) :- e(X,Y), 5 ≤X≤7. “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

(5,8,4) (2,4,3) (1,1,1) SDP for Selection by Example (2,4,3) (5,5,1) (6,8,3) (5,8,4) (2,4,3) (1,1,1) (5,5,1) (6,7,2) • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Extract the portions covered by the selection • Recompute matrix values • Sum them up as size(answer)=3+.67+.67+2 =6.34 • For each row, recompute (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: Sort-Merge-Join • answer(X,Z) :- a(X,Y), b(Y,Z) • middle(X,Y,Z) is for the ease of explanation ...... …… a(4,3) b(3,1) a(4,4) b(3,5) b(4,5) …… …… answer (4,1) (4,5) (4,5) …… middle (4,3,1) (4,3,5) (4,4,5) …… Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Simulate Sort-Merge-Join (1,1,1) (6,8,2) (2,4,2) (9,9,1) (2,4,3) (1,1,1) (5,5,1) (2,4,2) align A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 • Result size of middle(X,Y,Z) can be estimated as • min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size) • Examples: • size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1) • size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Examples: • middle((2,4,3),(1,1,1),(6,8,2))  answer((2,4,3),(6,8,2)) • middle((2,4,3),(2,4,2),(6,8,2))  answer((2,4,3),(6,8,2)) • Three duplicate handling approaches • Sum: no duplicate removal • Max: most aggressive removal • Expected sum: remove “expected” number of duplicates Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

SDP for Recursive Predicates • Recursive predicates are computed incrementally until they reach approximate fixed points • Size reaches α-approximate fixed point if Δ(size)/size ≤ α where • Δ(…) is the difference between two consecutive iterations in fixed point computation • 0 ≤ α ≤ 1 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: Recursive Predicates • Transitive closure path(X,Y) :- edge(X,Y). (base) path(X,Y) :- edge(X,Z), path(Z,Y). (rec) • Computation of the estimate: • Compute size(path) and DM(path) using rule base • Compute size(path) and DM(path) using rule rec as in the case of a join • If size(path) reaches approximate fixed points, stop; Otherwise, go to step 2 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Experimental Studies • Test programs: • Transitive closure • General same generation • Datasets: generated with Thomas Process and Matern Cluster Process • Results • SDP estimates converge to real sizes for recursive predicates • Expected sum is good for duplicate removal • Details in the paper “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Experimental Studies • SDP estimates converge to real sizes for recursive predicates Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Experimental Studies • Expected sum is good for duplicate removal Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Conclusion Dependency matrix for binary predicates Overcomes problems with argument independence assumption SDP for selection, join, and recursion Experimental validations “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive systems, such as XSB “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Senlin Liang and Michael Kifer Stony Brook University

Senlin Liang and Michael Kifer Stony Brook University

Presentation Transcript

Stony Brook Group

Alexander Orlov Assistant Professor, Stony Brook University

Stony Brook University

SUNY Stony Brook BMES

Minghua Zhang Stony Brook University, SUNY

Yamil Velez │ Stony Brook University │ yamil.velez@stonybrook

Stony Brook Update

Stony Brook Update

Administrative Review at Stony Brook University

Sanjee Abeytunge Department of Physics and Astronomy Stony Brook University Stony Brook, New York.

Stony Brook Update:

STONY BROOK SCHOOL

Networking Research in Stony Brook University

Stony Brook Update

Stony Brook University School of Social Welfare

Stony Brook Update

Stony Brook Update

Stony Brook Update

Stony Brook Update

Stony Brook Update

Stony Brook Update

Stony Brook Update: