220 likes | 242 Views
GORDIAN offers efficient composite key discovery for data modeling, integration, and indexing. It scales well and uses cube computation and sampling for key identification.
E N D
GORDIAN: Efficient and Scalable Discovery of Composite Keys Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center VLDB 2006
Motivation • Key discovery is of fundamental importance for • Data Modeling • Data Integration • Anomaly Detection • Query Formulation & Optimization • Indexing • Key information is often incomplete/unknown to the DBMS because: • Represents an unknown “constraint/dependency” inherent to the domain • Keys arise fortuitously from statistical properties of the data • Exploited by the application without the DBMS explicitly knowing • Cost reasons force the DBA not to explicitly identify/enforce it • Computationally Hard • Finding the minimum (least number of attributes) key: NP-Hard • Finding all the keys: #P-Hard VLDB 2006
GORDIAN in a Nutshell • Good “typical-case” behavior on real-world datasets • Artificial and unrepresentative datasets used for worst-case performance • Scales well with the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Basic idea • Cube computation • Many optimizations since we do not deal with storing/indexing or even fully computing the cube • Interleave the computation with the discovery of non-keys • A non-key can be discovered by looking at a subset of the entities • A non-key cannot be invalidated as more entities are examined • Since any subset of the attributes of a non-key is also a non-key apply “Apriori”-like pruning techniques • Compute the complement yielding the desired keys • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006
Keys & Non-Keys • <EmpNo>, <Last Name,Phone>, <First Name,Phone>,… are keys • <First Name,Last Name>, <First Name>,… are non-keys • A non-key Kcovers a non-key K’ iff K’K • A set of non-keys {K1,K2,…} is non-redundant iff KiKj ,i≠j VLDB 2006
Cube Operator / NoKeys … • If any count >1 then the corresponding projection is a non-key • Computing the whole cube and then start looking for non-keys is very inefficient VLDB 2006
segments … Cube Slices/Segments • A selection on the cube returns a “slice” • i.e. FirstName=‘Michael’ • A slice consists of “segments” • GORDIAN is based on a slice-by-slice computation … Slice FirstName=‘Michael’ VLDB 2006
Slice Subsumption & Non-keys Slice F : First Name = ‘Michael’ … … Slice L : Last Name=‘Thompson’ • A slice is L “subsumed” by another slice F (LF) iff all the segments of L contain only tuples that can be extracted from the tuples of F by just projecting out some attributes • Subsumption Lemma: If LF then every non-key of L is redundant to some non-key of F • <LastName> is redundant to <FirstName,LastName> VLDB 2006
GORDIAN Overview VLDB 2006
Prefix tree VLDB 2006
Slice-by-Slice Computation • Slice corresponds to a path from the root to a node • I.e. FirstName=“Michael” and LastName=“Thompson” corresponds to the path (1),(2) • Segments of the slice are computing by “merging” sub-trees (next slide) • Slice-by-Slice computation = Systematic Traversal of the prefix Tree • Discovers subsumed slices VLDB 2006
(M1) 1 10 1 50 Compute segments by “merging” • Slice FirstName=‘Michael’ and LastName=‘Thompson’ • Path (1),(2) to node (3) • Compute segment <EmpNo> by merging the children trees of (3) (1) Michael First Name [0] (2) Thompson Last Name [1] (3) 3478 6791 Phone [2] (4) (5) 1 10 1 50 Emp No [3] VLDB 2006
After merging nodes (4) and (5): (M1) 10 1 50 1 After merging nodes (4), (5) and (7): (M3) 10 1 50 1 90 1 After merging nodes (2) and (8): (M4) Thompson Spencer Kwan (3) (9) (6) After merging nodes (3) and (6): After merging nodes (3),(6) and (9): (M5) 3478 6791 5237 (M2) 3478 6791 5237 (M6) 10 1 20 1 (5) (7) (4) (5) (7) After merging nodes (M6), (5) and (7): <First Name> (M7) 10 1 20 1 50 1 90 1 GORDIAN in action (1) Michael Sally First Name [0] (2) (8) Thompson Spencer Kwan Last Name [1] (3) (6) (9) 3478 6791 5237 3478 Phone [2] (4) (5) (7) (10) 1 1 1 10 1 50 90 20 Emp No [3] <First Name, Last Name> <Phone> VLDB 2006
Subsumption in Prefix-Trees • Subsumption appears because of • Correlations • (Ref1), (Ref2) have been previously traversed and point to subsumed slices • Sparsity • A root-to-node path points to a sparse area with a small number of tuples (a) Correlation (b) Sparsity VLDB 2006
Complexity • Zipfian Datasets • Singleton pruning only due to sparsity • No correlations (conservative approach) • Time complexity: Where:s the number of non-redundant non-keysd the number of attributesT the number entities andθ the skew of the data VLDB 2006
Sampling • A key for the whole dataset is a key in any sample • No missing keys • False “positives” • Discovery of approximate keys • Strength of an approximate key = • Tight Lower Bound on Strength : where:N is the sample sizeDv the distinct count VLDB 2006
Experiment Setup • GORDIAN implemented as a UDF for DB2 v8.2 • Performance comparison w.r.t • Number of Entities • Number of Attributes • Pruning Evaluation • Accuracy Evaluation (Approximate keys) VLDB 2006
Time Comparison VLDB 2006
Attribute Scalability VLDB 2006
Pruning Effect VLDB 2006
Accuracy Evaluation (Sampling) VLDB 2006
Conclusions • GORDIAN has excellent “typical-case” behavior on real-world datasets • Very good scalability w.r.t the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Innovation • Formulate as a Cube computation problem • Interleave the computation with the discovery of non-keys • Novel Subsumption Pruning • Extra Apriori-like pruning • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006
Questions? VLDB 2006