1 / 27

Extending Relational Database Functionality with Data Inconsistency Resolution Support

Extending Relational Database Functionality with Data Inconsistency Resolution Support. Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University. Inconsistency.

gita
Download Presentation

Extending Relational Database Functionality with Data Inconsistency Resolution Support

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University

  2. Inconsistency • Databases often contain information about real world objects • When the data is collected and entered in the database (or measured), errors are introduced • When the same object is measured more than once, inconsistent data values may result Ilya Pevzner, Arthur Goldberg

  3. Object Identification • Identification of records describing the same real-world object • If key values are inconsistent values, object identification is not trivial and its results are uncertain • Also known as approximatematching, duplicate detection and record linkage • Area with multiple successful techniques, topic of KDD-2003 workshop Ilya Pevzner, Arthur Goldberg

  4. Inconsistency Resolution Problem • Given what is known about the world, find the “best” estimates for values of the inconsistent attributes • Possible sources of the knowledge about the world: • The system designer or expert • The end user • The data • Inconsistency resolution is also called merging • Existing research is almost exclusively on a) and b) • No systematic techniques • Our work concentrates on c) Ilya Pevzner, Arthur Goldberg

  5. Matching/Merging Example • Match using ID (trivial) • Merge using standardization Ilya Pevzner, Arthur Goldberg

  6. Merging Uncertainty • Sometimes it is possible, but non-trivial, to tell which attribute value is best • In other cases, the answer is uncertain Ilya Pevzner, Arthur Goldberg

  7. Research goals • Develop merging methodologies that rely on the analysis of the data • Extend relational databases with • Integrated model for representing matching and merging uncertainties • Integrated support for various matching and merging methodologies Ilya Pevzner, Arthur Goldberg

  8. Uncertainty in Relational Databases • Semantics of Nulls • E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993. • Fuzzy databases • E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988. • Probabilistic relations • E.g. E Zimanyi and A. Pirotte. Imperfect Information in Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997. Ilya Pevzner, Arthur Goldberg

  9. Probabilistic relations overview • Probabilistic relations model uncertainty with truth probabilities added to classic relations • E.g. tuple X is in relation with probability P[X] • Each probabilistic relation is associated with a set of classic relations representing “possible worlds” where the collection of outcomes for each probabilistic choice is fixed • E.g. the probabilistic relation with the probabilistic choice in the above example will have two possible worlds – one with tuple X and one without • Relational operations are defined through the associated classic relations Ilya Pevzner, Arthur Goldberg

  10. Zimanyi’sType-1 Probabilistic Relation • Definition • A type-1 probabilistic relation is a relation R with a supplementary attribute w(R, t) addedto each tuple t indicating the probability that a tuple t belongs to relation R Ilya Pevzner, Arthur Goldberg

  11. Zimanyi’sType-1 Probabilistic Relation Example • Probabilistic relation • Possible worlds (assuming unique(ID1) and unique(ID2)): Ilya Pevzner, Arthur Goldberg

  12. Probabilistic matching • Example: matching by name • The way w(R, t)is computed depends on the matching methodology • An example of such methodology is ChoiceMaker™ Ilya Pevzner, Arthur Goldberg

  13. Zimanyi’sType-2 Probabilistic Relation • Definition • Generalized relation in which attribute values can be probabilistic sets Ilya Pevzner, Arthur Goldberg

  14. Zimanyi’sType-2 Probabilistic Relation Example • Probabilistic relation • Possible worlds Ilya Pevzner, Arthur Goldberg

  15. Probabilistic Merging Example • Data sources • Query: • List all people with the their correct name and social security number • Execution plan: • Join using SSN (UID) • Merge names Ilya Pevzner, Arthur Goldberg

  16. Probabilistic Merging Example: Result MERGE Ilya Pevzner, Arthur Goldberg

  17. Merging Methodologies • Ad-hoc techniques • Standardization • E.g. convert both Jim and Jimmy to James • Pre-defined rules • E.g. use gender to pick Andrea and not Andrew • Machine Learning • Supervised (e.g. MaxEnt) • Use experts to manually merge some data, use it to train and validate • Unsupervised (e.g. dependency-based) • E.g. Mine data for dependencies, use dependencies to pick the best estimates Ilya Pevzner, Arthur Goldberg

  18. SQL Extensions • The MATCH predicate • Uses a specified matching methodology to determine if specified tuples describe the same object • The MERGE function • Uses a specified merging methodology to provide estimates for values of specified attributes • The PROB function • Provides access to probabilities in type-1 and type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg

  19. The MATCH Predicate • Can be used in the WHERE clause of SELECT statement • Takes the name of the matcher module and the tuples to be tested • Returns true if the tuples match with probability exceeding the matcher threshold. Otherwise, returns false • SELECT statements with MATCH produce type-1 probabilistic relations Ilya Pevzner, Arthur Goldberg

  20. MATCH Example • Data source relations • Query SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 WHERE MATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) • Result Ilya Pevzner, Arthur Goldberg

  21. The MERGE function • May appear in SELECT list • Accepts two parameters • Merger name • Merge list • Returns a table of the form (v, wf) where v is a value and wfis the corresponding probability • SELECT statements with MERGE produce type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg

  22. MERGE Example • Data sources • Query SELECT S1.SSN, MERGE(‘NAME_MERGER’, (S1.NAME, S2.NAME)) AS NAME FROM S1, S2 WHERE S1.SSN=S2.SSN • Result Ilya Pevzner, Arthur Goldberg

  23. Query Processing Diagram Ilya Pevzner, Arthur Goldberg

  24. Interfaces Ilya Pevzner, Arthur Goldberg

  25. Validating with real-world data • MEDLINE data set • Affiliation Fields: • E-mail, Organization, Address • Statistics: • 2,391,822 affiliations • 523,140 matched by e-mail address • 182,892 with US addresses • 32,505 non-identical duplicates • Looking for other interesting data sets • Errors • Dependencies • Duplicates • More distinct items • More Fields Ilya Pevzner, Arthur Goldberg

  26. Future plans • Consider several data sets • Develop several merging methodologies • Evaluate using real data and looking at • Performance • Merge Quality • Usability Ilya Pevzner, Arthur Goldberg

  27. Questions • ? Ilya Pevzner, Arthur Goldberg

More Related