270 likes | 407 Views
Extending Relational Database Functionality with Data Inconsistency Resolution Support. Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University. Inconsistency.
E N D
Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University
Inconsistency • Databases often contain information about real world objects • When the data is collected and entered in the database (or measured), errors are introduced • When the same object is measured more than once, inconsistent data values may result Ilya Pevzner, Arthur Goldberg
Object Identification • Identification of records describing the same real-world object • If key values are inconsistent values, object identification is not trivial and its results are uncertain • Also known as approximatematching, duplicate detection and record linkage • Area with multiple successful techniques, topic of KDD-2003 workshop Ilya Pevzner, Arthur Goldberg
Inconsistency Resolution Problem • Given what is known about the world, find the “best” estimates for values of the inconsistent attributes • Possible sources of the knowledge about the world: • The system designer or expert • The end user • The data • Inconsistency resolution is also called merging • Existing research is almost exclusively on a) and b) • No systematic techniques • Our work concentrates on c) Ilya Pevzner, Arthur Goldberg
Matching/Merging Example • Match using ID (trivial) • Merge using standardization Ilya Pevzner, Arthur Goldberg
Merging Uncertainty • Sometimes it is possible, but non-trivial, to tell which attribute value is best • In other cases, the answer is uncertain Ilya Pevzner, Arthur Goldberg
Research goals • Develop merging methodologies that rely on the analysis of the data • Extend relational databases with • Integrated model for representing matching and merging uncertainties • Integrated support for various matching and merging methodologies Ilya Pevzner, Arthur Goldberg
Uncertainty in Relational Databases • Semantics of Nulls • E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993. • Fuzzy databases • E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988. • Probabilistic relations • E.g. E Zimanyi and A. Pirotte. Imperfect Information in Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997. Ilya Pevzner, Arthur Goldberg
Probabilistic relations overview • Probabilistic relations model uncertainty with truth probabilities added to classic relations • E.g. tuple X is in relation with probability P[X] • Each probabilistic relation is associated with a set of classic relations representing “possible worlds” where the collection of outcomes for each probabilistic choice is fixed • E.g. the probabilistic relation with the probabilistic choice in the above example will have two possible worlds – one with tuple X and one without • Relational operations are defined through the associated classic relations Ilya Pevzner, Arthur Goldberg
Zimanyi’sType-1 Probabilistic Relation • Definition • A type-1 probabilistic relation is a relation R with a supplementary attribute w(R, t) addedto each tuple t indicating the probability that a tuple t belongs to relation R Ilya Pevzner, Arthur Goldberg
Zimanyi’sType-1 Probabilistic Relation Example • Probabilistic relation • Possible worlds (assuming unique(ID1) and unique(ID2)): Ilya Pevzner, Arthur Goldberg
Probabilistic matching • Example: matching by name • The way w(R, t)is computed depends on the matching methodology • An example of such methodology is ChoiceMaker™ Ilya Pevzner, Arthur Goldberg
Zimanyi’sType-2 Probabilistic Relation • Definition • Generalized relation in which attribute values can be probabilistic sets Ilya Pevzner, Arthur Goldberg
Zimanyi’sType-2 Probabilistic Relation Example • Probabilistic relation • Possible worlds Ilya Pevzner, Arthur Goldberg
Probabilistic Merging Example • Data sources • Query: • List all people with the their correct name and social security number • Execution plan: • Join using SSN (UID) • Merge names Ilya Pevzner, Arthur Goldberg
Probabilistic Merging Example: Result MERGE Ilya Pevzner, Arthur Goldberg
Merging Methodologies • Ad-hoc techniques • Standardization • E.g. convert both Jim and Jimmy to James • Pre-defined rules • E.g. use gender to pick Andrea and not Andrew • Machine Learning • Supervised (e.g. MaxEnt) • Use experts to manually merge some data, use it to train and validate • Unsupervised (e.g. dependency-based) • E.g. Mine data for dependencies, use dependencies to pick the best estimates Ilya Pevzner, Arthur Goldberg
SQL Extensions • The MATCH predicate • Uses a specified matching methodology to determine if specified tuples describe the same object • The MERGE function • Uses a specified merging methodology to provide estimates for values of specified attributes • The PROB function • Provides access to probabilities in type-1 and type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg
The MATCH Predicate • Can be used in the WHERE clause of SELECT statement • Takes the name of the matcher module and the tuples to be tested • Returns true if the tuples match with probability exceeding the matcher threshold. Otherwise, returns false • SELECT statements with MATCH produce type-1 probabilistic relations Ilya Pevzner, Arthur Goldberg
MATCH Example • Data source relations • Query SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 WHERE MATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) • Result Ilya Pevzner, Arthur Goldberg
The MERGE function • May appear in SELECT list • Accepts two parameters • Merger name • Merge list • Returns a table of the form (v, wf) where v is a value and wfis the corresponding probability • SELECT statements with MERGE produce type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg
MERGE Example • Data sources • Query SELECT S1.SSN, MERGE(‘NAME_MERGER’, (S1.NAME, S2.NAME)) AS NAME FROM S1, S2 WHERE S1.SSN=S2.SSN • Result Ilya Pevzner, Arthur Goldberg
Query Processing Diagram Ilya Pevzner, Arthur Goldberg
Interfaces Ilya Pevzner, Arthur Goldberg
Validating with real-world data • MEDLINE data set • Affiliation Fields: • E-mail, Organization, Address • Statistics: • 2,391,822 affiliations • 523,140 matched by e-mail address • 182,892 with US addresses • 32,505 non-identical duplicates • Looking for other interesting data sets • Errors • Dependencies • Duplicates • More distinct items • More Fields Ilya Pevzner, Arthur Goldberg
Future plans • Consider several data sets • Develop several merging methodologies • Evaluate using real data and looking at • Performance • Merge Quality • Usability Ilya Pevzner, Arthur Goldberg
Questions • ? Ilya Pevzner, Arthur Goldberg