250 likes | 385 Views
Constraint Processing Techniques for Improving Join Computation: A Proof of Concept. Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department of Computer Science & Engineering University of Nebraska-Lincoln. An illustrative example. Join query SELECT R1.A,R1.B,R1.C
E N D
Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department of Computer Science & Engineering University of Nebraska-Lincoln
An illustrative example • Join query SELECT R1.A,R1.B,R1.C FROM R1,R2 WHERE R1.A=R2.A AND R1.B=R2.B AND R1.C=R2.C • 10 tuples in 3 nested tuples
Advantages • Direct • Savings of number of tuple comparisons • Savings in I/O for next operator • Space reduction of materialized join queries • Future applications • Use for query size estimation • Assist in high-level analysis of data & in data mining
Our contributions • A new representation of a join query as a Constraint Satisfaction Problem (CSP) • A new sorting-based bundling algorithm • Suitable for CSPs with fewer and larger constraints (i.e., join) • Improves memory usage • A new sort-merge join algorithm for producing (dynamically) bundled tuples • Yields compact representation, saves memory space • Identification of possible applications • Data analysis • Materialized views • Assisting query-size estimation Suggested, not yet demonstrated
V1 V2 {c, d, e, f} {d} V4 V3 {a, b, d} {a, b, c} Constraint Satisfaction Problem • Given P= (V, D, C) • V = {Vi}, a set of variables • D = {DVi}, the set of their respective domains • C is a set of constraints restricting the acceptable combination of values for variables. • Solution is a consistent assignment of values to variables • Query: find 1 solution, all solutions, etc.
V1 V2 {c, d, e, f} {d} S S S V1 V1 V1 d d d V4 V3 V2 V2 V2 {a, b, d} {a, b, c} c e f d c e, f d c d, e, f Solving CSPs • Typically, DFS & backtracking • Improvement • Static bundling [Freuder 91] • Dynamic bundling [our group] • Based on dynamically identifying symmetries • Guaranteed never less efficient than non-bundling, static bundling Dynamic bundling Without bundling Static bundling
Modeling Join as a CSP • Attributes of relations CSP variables • Attribute values variable domains • Relations relational constraints • Join conditions join-condition constraints • SELECT R1.A,R1.B,R1.C • FROM R1,R2 • WHERE R1.A=R2.A • AND R1.B=R2.B • AND R1.C=R2.C
Sorting-based bundling R1.A • Heuristic for variable ordering Place variables linked by join conditions as close to each other as possible R2.A R1 R1.B R2.B R2 R1.C R2.C • Sort relations using above ordering • Next: Compute bundles of variable ahead in variable ordering (R1.A)
Bundling an attribute • Partition of a constraint Tuples of the relation having the same value of R1.A • Compare projected tuples of first partition with those of another partition • Compare with every other partition to get complete bundle Partition Unequal partitions Symmetric partitions Bundle {1, 5}
Join using dynamic bundling Select next- variable Compute next valid bundle Start Move to previous variable Found bundle? No Output one tuple Undo previous assignment No Yes 1st in Ordering? Yes Assign bundle Last variable? Yes No Stop
Finding the valid bundle Common {1, 5} • Compute a bundle for the attribute • Check bundle validity with future constraints • If no common value found GOTO 1 Assign variable with the surviving values in the bundle {1, 5, x} {1, 5, y, z}
Analysis of overheads • For Bundling • Additional data structures: 2 arrays, 1 pointer • Only 1 array may become cumbersome • Array size is largest • when all the values of a variable are in one bundle • But, this case also leads to best savings! • Improved implementation • Use of Bitmaps?
ProgressiveMerge Join • PMJ: A sort-merge algorithm by [Dittrich et al. 03] • Provides early results • Assists in query size-estimation • Two main phases • Sorting: starts producing results in this phase • Merging phase: merges sorted runs • We use the framework of the PMJ for our external join. • Implemented & evaluated with the XXL library • We use the same library for our implementation
Preliminary experiments • Data sets • Random: 2 relations R1, R2 with same schema as example • Each relation: 10’000 tuples • Memory size: 4’000 tuples • Page size 200 tuples • Real-world problem: 3 relations, 4 attributes • Compaction rate achieved • Random problem: 1.48 • Savings compensate for even worst case (of the current experimental implementation) • Real-world problem: 2.26 (69 tuples in 32 nested tuples)
Related work • Join algorithms • Well established algorithms • Do not focus on exploiting symmetry • Database compression • Output results are not compressed • Compression at value level, not tuple level
Related work (contd) • [Mamoulis & Papadias 1998] • Join using FC for spatial DB • Restricted to binary constraints • No compaction of solution space • [Bayardo et al. 1996] • Reduce the number of the intermediate tuples of a sequence of joins • [Rich et al. 1993] • Do not compact join attribute values • Does not detect redundancy present in the grouped sub-relations
Future work • Refine implementation • Use of lighter data structures • Test usefulness in the context of Constraint DBs • Values are continuous intervals, e.g. spatial database • Conduct thorough evaluations of overall performance & overhead (memory & CPU) on different data distributions • Investigate benefit of using bundling • query size estimation • materialized views
Bundling relations: Data structures • Considering the portion of the relation in memory • Current-Inst: To store the current instantiations of past variables Vpof R1. • Current-Constraint: selection of R’: • Past variable values equal Current-Inst • Current variable Vc > all previous instantiations of Vc
Bundling relations:Computing bundles (Algorithm 1) • NEXT-PARTITION(p) returns the first unchecked partition in Current-Constraint following the partition p. • Sorted constraints Checking equality of tuples is efficient
Bundling relations: Data structures • Processed-Values: Cumulatively stores non-representative values of bundles • Computing bundles of Vc Values of Vc in it are ignored • Partition pis marked as checked when: • Value(p) is in an instantiation bundle • p is selected for comparing with other partitions to check for bundles
Join computation: In memory • Two subsets of relations (some pages) in memory: • Algorithm to find result of joining the two. • Join computed as a search • Finding all solutions • After finding one solution, search resumes from same depth • Algorithm shown can be entered at any “depth” in the search • Uses Algorithm 1 to find bundles for assigning to variables
Expanded on next slide Join computation: In memory • Join as a search (Algo. 2) • BACKTRACK • Variable[depth] in Current-Inst reset • Processed-Values for the variable emptied • Value in Current-Solution reset • Current-Constraint re-computed • Undoes the effects of the previous instantiation.
Join computation: In memory • COMMON(bi, bundles) subset of bi consistent using join-condition constraints • For equality COMMON Intersection • Empty result of COMMON inconsistency BACKTRACK