220 likes | 241 Views
GORDIAN: Efficient and Scalable Discovery of Composite Keys. Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center. Motivation. Key discovery is of fundamental importance for Data Modeling Data Integration Anomaly Detection
E N D
GORDIAN: Efficient and Scalable Discovery of Composite Keys Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center VLDB 2006
Motivation • Key discovery is of fundamental importance for • Data Modeling • Data Integration • Anomaly Detection • Query Formulation & Optimization • Indexing • Key information is often incomplete/unknown to the DBMS because: • Represents an unknown “constraint/dependency” inherent to the domain • Keys arise fortuitously from statistical properties of the data • Exploited by the application without the DBMS explicitly knowing • Cost reasons force the DBA not to explicitly identify/enforce it • Computationally Hard • Finding the minimum (least number of attributes) key: NP-Hard • Finding all the keys: #P-Hard VLDB 2006
GORDIAN in a Nutshell • Good “typical-case” behavior on real-world datasets • Artificial and unrepresentative datasets used for worst-case performance • Scales well with the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Basic idea • Cube computation • Many optimizations since we do not deal with storing/indexing or even fully computing the cube • Interleave the computation with the discovery of non-keys • A non-key can be discovered by looking at a subset of the entities • A non-key cannot be invalidated as more entities are examined • Since any subset of the attributes of a non-key is also a non-key apply “Apriori”-like pruning techniques • Compute the complement yielding the desired keys • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006
Keys & Non-Keys • <EmpNo>, <Last Name,Phone>, <First Name,Phone>,… are keys • <First Name,Last Name>, <First Name>,… are non-keys • A non-key Kcovers a non-key K’ iff K’K • A set of non-keys {K1,K2,…} is non-redundant iff KiKj ,i≠j VLDB 2006
Cube Operator / NoKeys … • If any count >1 then the corresponding projection is a non-key • Computing the whole cube and then start looking for non-keys is very inefficient VLDB 2006
segments … Cube Slices/Segments • A selection on the cube returns a “slice” • i.e. FirstName=‘Michael’ • A slice consists of “segments” • GORDIAN is based on a slice-by-slice computation … Slice FirstName=‘Michael’ VLDB 2006
Slice Subsumption & Non-keys Slice F : First Name = ‘Michael’ … … Slice L : Last Name=‘Thompson’ • A slice is L “subsumed” by another slice F (LF) iff all the segments of L contain only tuples that can be extracted from the tuples of F by just projecting out some attributes • Subsumption Lemma: If LF then every non-key of L is redundant to some non-key of F • <LastName> is redundant to <FirstName,LastName> VLDB 2006
GORDIAN Overview VLDB 2006
Prefix tree VLDB 2006
Slice-by-Slice Computation • Slice corresponds to a path from the root to a node • I.e. FirstName=“Michael” and LastName=“Thompson” corresponds to the path (1),(2) • Segments of the slice are computing by “merging” sub-trees (next slide) • Slice-by-Slice computation = Systematic Traversal of the prefix Tree • Discovers subsumed slices VLDB 2006
(M1) 1 10 1 50 Compute segments by “merging” • Slice FirstName=‘Michael’ and LastName=‘Thompson’ • Path (1),(2) to node (3) • Compute segment <EmpNo> by merging the children trees of (3) (1) Michael First Name [0] (2) Thompson Last Name [1] (3) 3478 6791 Phone [2] (4) (5) 1 10 1 50 Emp No [3] VLDB 2006
After merging nodes (4) and (5): (M1) 10 1 50 1 After merging nodes (4), (5) and (7): (M3) 10 1 50 1 90 1 After merging nodes (2) and (8): (M4) Thompson Spencer Kwan (3) (9) (6) After merging nodes (3) and (6): After merging nodes (3),(6) and (9): (M5) 3478 6791 5237 (M2) 3478 6791 5237 (M6) 10 1 20 1 (5) (7) (4) (5) (7) After merging nodes (M6), (5) and (7): <First Name> (M7) 10 1 20 1 50 1 90 1 GORDIAN in action (1) Michael Sally First Name [0] (2) (8) Thompson Spencer Kwan Last Name [1] (3) (6) (9) 3478 6791 5237 3478 Phone [2] (4) (5) (7) (10) 1 1 1 10 1 50 90 20 Emp No [3] <First Name, Last Name> <Phone> VLDB 2006
Subsumption in Prefix-Trees • Subsumption appears because of • Correlations • (Ref1), (Ref2) have been previously traversed and point to subsumed slices • Sparsity • A root-to-node path points to a sparse area with a small number of tuples (a) Correlation (b) Sparsity VLDB 2006
Complexity • Zipfian Datasets • Singleton pruning only due to sparsity • No correlations (conservative approach) • Time complexity: Where:s the number of non-redundant non-keysd the number of attributesT the number entities andθ the skew of the data VLDB 2006
Sampling • A key for the whole dataset is a key in any sample • No missing keys • False “positives” • Discovery of approximate keys • Strength of an approximate key = • Tight Lower Bound on Strength : where:N is the sample sizeDv the distinct count VLDB 2006
Experiment Setup • GORDIAN implemented as a UDF for DB2 v8.2 • Performance comparison w.r.t • Number of Entities • Number of Attributes • Pruning Evaluation • Accuracy Evaluation (Approximate keys) VLDB 2006
Time Comparison VLDB 2006
Attribute Scalability VLDB 2006
Pruning Effect VLDB 2006
Accuracy Evaluation (Sampling) VLDB 2006
Conclusions • GORDIAN has excellent “typical-case” behavior on real-world datasets • Very good scalability w.r.t the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Innovation • Formulate as a Cube computation problem • Interleave the computation with the discovery of non-keys • Novel Subsumption Pruning • Extra Apriori-like pruning • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006
Questions? VLDB 2006