Extending Relational Database Functionality with Data Inconsistency Resolution Support

Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University

Inconsistency • Databases often contain information about real world objects • When the data is collected and entered in the database (or measured), errors are introduced • When the same object is measured more than once, inconsistent data values may result Ilya Pevzner, Arthur Goldberg

Object Identification • Identification of records describing the same real-world object • If key values are inconsistent values, object identification is not trivial and its results are uncertain • Also known as approximatematching, duplicate detection and record linkage • Area with multiple successful techniques, topic of KDD-2003 workshop Ilya Pevzner, Arthur Goldberg

Inconsistency Resolution Problem • Given what is known about the world, find the “best” estimates for values of the inconsistent attributes • Possible sources of the knowledge about the world: • The system designer or expert • The end user • The data • Inconsistency resolution is also called merging • Existing research is almost exclusively on a) and b) • No systematic techniques • Our work concentrates on c) Ilya Pevzner, Arthur Goldberg

Matching/Merging Example • Match using ID (trivial) • Merge using standardization Ilya Pevzner, Arthur Goldberg

Merging Uncertainty • Sometimes it is possible, but non-trivial, to tell which attribute value is best • In other cases, the answer is uncertain Ilya Pevzner, Arthur Goldberg

Research goals • Develop merging methodologies that rely on the analysis of the data • Extend relational databases with • Integrated model for representing matching and merging uncertainties • Integrated support for various matching and merging methodologies Ilya Pevzner, Arthur Goldberg

Uncertainty in Relational Databases • Semantics of Nulls • E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993. • Fuzzy databases • E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988. • Probabilistic relations • E.g. E Zimanyi and A. Pirotte. Imperfect Information in Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997. Ilya Pevzner, Arthur Goldberg

Probabilistic relations overview • Probabilistic relations model uncertainty with truth probabilities added to classic relations • E.g. tuple X is in relation with probability P[X] • Each probabilistic relation is associated with a set of classic relations representing “possible worlds” where the collection of outcomes for each probabilistic choice is fixed • E.g. the probabilistic relation with the probabilistic choice in the above example will have two possible worlds – one with tuple X and one without • Relational operations are defined through the associated classic relations Ilya Pevzner, Arthur Goldberg

Zimanyi’sType-1 Probabilistic Relation • Definition • A type-1 probabilistic relation is a relation R with a supplementary attribute w(R, t) addedto each tuple t indicating the probability that a tuple t belongs to relation R Ilya Pevzner, Arthur Goldberg

Zimanyi’sType-1 Probabilistic Relation Example • Probabilistic relation • Possible worlds (assuming unique(ID1) and unique(ID2)): Ilya Pevzner, Arthur Goldberg

Probabilistic matching • Example: matching by name • The way w(R, t)is computed depends on the matching methodology • An example of such methodology is ChoiceMaker™ Ilya Pevzner, Arthur Goldberg

Zimanyi’sType-2 Probabilistic Relation • Definition • Generalized relation in which attribute values can be probabilistic sets Ilya Pevzner, Arthur Goldberg

Zimanyi’sType-2 Probabilistic Relation Example • Probabilistic relation • Possible worlds Ilya Pevzner, Arthur Goldberg

Probabilistic Merging Example • Data sources • Query: • List all people with the their correct name and social security number • Execution plan: • Join using SSN (UID) • Merge names Ilya Pevzner, Arthur Goldberg

Probabilistic Merging Example: Result MERGE Ilya Pevzner, Arthur Goldberg

Merging Methodologies • Ad-hoc techniques • Standardization • E.g. convert both Jim and Jimmy to James • Pre-defined rules • E.g. use gender to pick Andrea and not Andrew • Machine Learning • Supervised (e.g. MaxEnt) • Use experts to manually merge some data, use it to train and validate • Unsupervised (e.g. dependency-based) • E.g. Mine data for dependencies, use dependencies to pick the best estimates Ilya Pevzner, Arthur Goldberg

SQL Extensions • The MATCH predicate • Uses a specified matching methodology to determine if specified tuples describe the same object • The MERGE function • Uses a specified merging methodology to provide estimates for values of specified attributes • The PROB function • Provides access to probabilities in type-1 and type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg

The MATCH Predicate • Can be used in the WHERE clause of SELECT statement • Takes the name of the matcher module and the tuples to be tested • Returns true if the tuples match with probability exceeding the matcher threshold. Otherwise, returns false • SELECT statements with MATCH produce type-1 probabilistic relations Ilya Pevzner, Arthur Goldberg

MATCH Example • Data source relations • Query SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 WHERE MATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) • Result Ilya Pevzner, Arthur Goldberg

The MERGE function • May appear in SELECT list • Accepts two parameters • Merger name • Merge list • Returns a table of the form (v, wf) where v is a value and wfis the corresponding probability • SELECT statements with MERGE produce type-2 probabilistic relations Ilya Pevzner, Arthur Goldberg

MERGE Example • Data sources • Query SELECT S1.SSN, MERGE(‘NAME_MERGER’, (S1.NAME, S2.NAME)) AS NAME FROM S1, S2 WHERE S1.SSN=S2.SSN • Result Ilya Pevzner, Arthur Goldberg

Query Processing Diagram Ilya Pevzner, Arthur Goldberg

Interfaces Ilya Pevzner, Arthur Goldberg

Validating with real-world data • MEDLINE data set • Affiliation Fields: • E-mail, Organization, Address • Statistics: • 2,391,822 affiliations • 523,140 matched by e-mail address • 182,892 with US addresses • 32,505 non-identical duplicates • Looking for other interesting data sets • Errors • Dependencies • Duplicates • More distinct items • More Fields Ilya Pevzner, Arthur Goldberg

Future plans • Consider several data sets • Develop several merging methodologies • Evaluate using real data and looking at • Performance • Merge Quality • Usability Ilya Pevzner, Arthur Goldberg

Questions • ? Ilya Pevzner, Arthur Goldberg

Extending Relational Database Functionality with Data Inconsistency Resolution Support

Extending Relational Database Functionality with Data Inconsistency Resolution Support

Presentation Transcript

The Relational Data Model and Relational Database Constraints

The Relational Data Model and Relational Database Constraints

Relational Database Design

Relational Database Design

Relational Database Systems

Relational Database

Extending Database Usability

Relational Database

The Relational Data Model and Relational Database Constraints

Relational Database

2004: Extending our Lead with New Functionality

Relational Database vs. Data Files

Erlang refactoring with relational database

Database Support for Temporal 3D Data: Extending the GeoToolKit

Relational Database

WP3.4 Reasoning with Inconsistency

Extending Relational Databases

Extending a Relational Database Intelligent storage of geospatial data John.Pickford@uk.ibm

Relational Database Internals

The Relational Data Model and Relational Database Constraints

Database Systems The Relational Data Model

The Relational Data Model and Relational Database Constraints