1 / 22

GORDIAN: Efficient and Scalable Discovery of Composite Keys

GORDIAN: Efficient and Scalable Discovery of Composite Keys. Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center. Motivation. Key discovery is of fundamental importance for Data Modeling Data Integration Anomaly Detection

gflynn
Download Presentation

GORDIAN: Efficient and Scalable Discovery of Composite Keys

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GORDIAN: Efficient and Scalable Discovery of Composite Keys Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center VLDB 2006

  2. Motivation • Key discovery is of fundamental importance for • Data Modeling • Data Integration • Anomaly Detection • Query Formulation & Optimization • Indexing • Key information is often incomplete/unknown to the DBMS because: • Represents an unknown “constraint/dependency” inherent to the domain • Keys arise fortuitously from statistical properties of the data • Exploited by the application without the DBMS explicitly knowing • Cost reasons force the DBA not to explicitly identify/enforce it • Computationally Hard • Finding the minimum (least number of attributes) key: NP-Hard • Finding all the keys: #P-Hard VLDB 2006

  3. GORDIAN in a Nutshell • Good “typical-case” behavior on real-world datasets • Artificial and unrepresentative datasets used for worst-case performance • Scales well with the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Basic idea • Cube computation • Many optimizations since we do not deal with storing/indexing or even fully computing the cube • Interleave the computation with the discovery of non-keys • A non-key can be discovered by looking at a subset of the entities • A non-key cannot be invalidated as more entities are examined • Since any subset of the attributes of a non-key is also a non-key apply “Apriori”-like pruning techniques • Compute the complement yielding the desired keys • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006

  4. Keys & Non-Keys • <EmpNo>, <Last Name,Phone>, <First Name,Phone>,… are keys • <First Name,Last Name>, <First Name>,… are non-keys • A non-key Kcovers a non-key K’ iff K’K • A set of non-keys {K1,K2,…} is non-redundant iff KiKj ,i≠j VLDB 2006

  5. Cube Operator / NoKeys … • If any count >1 then the corresponding projection is a non-key • Computing the whole cube and then start looking for non-keys is very inefficient VLDB 2006

  6. segments … Cube Slices/Segments • A selection on the cube returns a “slice” • i.e. FirstName=‘Michael’ • A slice consists of “segments” • GORDIAN is based on a slice-by-slice computation … Slice FirstName=‘Michael’ VLDB 2006

  7. Slice Subsumption & Non-keys Slice F : First Name = ‘Michael’ … … Slice L : Last Name=‘Thompson’ • A slice is L “subsumed” by another slice F (LF) iff all the segments of L contain only tuples that can be extracted from the tuples of F by just projecting out some attributes • Subsumption Lemma: If LF then every non-key of L is redundant to some non-key of F • <LastName> is redundant to <FirstName,LastName> VLDB 2006

  8. GORDIAN Overview VLDB 2006

  9. Prefix tree VLDB 2006

  10. Slice-by-Slice Computation • Slice corresponds to a path from the root to a node • I.e. FirstName=“Michael” and LastName=“Thompson” corresponds to the path (1),(2) • Segments of the slice are computing by “merging” sub-trees (next slide) • Slice-by-Slice computation = Systematic Traversal of the prefix Tree • Discovers subsumed slices VLDB 2006

  11. (M1) 1 10 1 50 Compute segments by “merging” • Slice FirstName=‘Michael’ and LastName=‘Thompson’ • Path (1),(2) to node (3) • Compute segment <EmpNo> by merging the children trees of (3) (1) Michael First Name [0] (2) Thompson Last Name [1] (3) 3478 6791 Phone [2] (4) (5) 1 10 1 50 Emp No [3] VLDB 2006

  12. After merging nodes (4) and (5): (M1) 10 1 50 1 After merging nodes (4), (5) and (7): (M3) 10 1 50 1 90 1 After merging nodes (2) and (8): (M4) Thompson Spencer Kwan (3) (9) (6) After merging nodes (3) and (6): After merging nodes (3),(6) and (9): (M5) 3478 6791 5237 (M2) 3478 6791 5237 (M6) 10 1 20 1 (5) (7) (4) (5) (7) After merging nodes (M6), (5) and (7): <First Name> (M7) 10 1 20 1 50 1 90 1 GORDIAN in action (1) Michael Sally First Name [0] (2) (8) Thompson Spencer Kwan Last Name [1] (3) (6) (9) 3478 6791 5237 3478 Phone [2] (4) (5) (7) (10) 1 1 1 10 1 50 90 20 Emp No [3] <First Name, Last Name> <Phone> VLDB 2006

  13. Subsumption in Prefix-Trees • Subsumption appears because of • Correlations • (Ref1), (Ref2) have been previously traversed and point to subsumed slices • Sparsity • A root-to-node path points to a sparse area with a small number of tuples (a) Correlation (b) Sparsity VLDB 2006

  14. Complexity • Zipfian Datasets • Singleton pruning only due to sparsity • No correlations (conservative approach) • Time complexity: Where:s the number of non-redundant non-keysd the number of attributesT the number entities andθ the skew of the data VLDB 2006

  15. Sampling • A key for the whole dataset is a key in any sample • No missing keys • False “positives” • Discovery of approximate keys • Strength of an approximate key = • Tight Lower Bound on Strength : where:N is the sample sizeDv the distinct count VLDB 2006

  16. Experiment Setup • GORDIAN implemented as a UDF for DB2 v8.2 • Performance comparison w.r.t • Number of Entities • Number of Attributes • Pruning Evaluation • Accuracy Evaluation (Approximate keys) VLDB 2006

  17. Time Comparison VLDB 2006

  18. Attribute Scalability VLDB 2006

  19. Pruning Effect VLDB 2006

  20. Accuracy Evaluation (Sampling) VLDB 2006

  21. Conclusions • GORDIAN has excellent “typical-case” behavior on real-world datasets • Very good scalability w.r.t the number of attributes • For Zipfian data time complexity is polynomial on number of attributes • Innovation • Formulate as a Cube computation problem • Interleave the computation with the discovery of non-keys • Novel Subsumption Pruning • Extra Apriori-like pruning • Sampling extensions for the discovery of high-quality approximate keys VLDB 2006

  22. Questions? VLDB 2006

More Related