470 likes | 626 Views
Telcordia Contact: Paolo Missier paolo@research.telcordia.com February, 2001. Improving Data Quality Methodology and Techniques. Telcordia Technologies Proprietary – Internal Use Only
E N D
Telcordia Contact:Paolo Missier paolo@research.telcordia.com February, 2001 Improving Data QualityMethodology and Techniques Telcordia Technologies Proprietary – Internal Use Only This document contains proprietary information that shall be distributed, routed or made available only within Telcordia Technologies, except with written permission of Telcordia Technologies. An SAIC Company
Presentation outline • Rationale for Data Quality • Canonical horror stories plus some good numbers • Process and Root Causes Analysis for quality improvement • Data error types • Data Cleaning • The record matching problem. Searching and matching • Distance functions • Approaches to record matching • Example: The Italian Ministry of Finances case study • Analysis toolkits • The INRIA Extensible framework • The Telcordia Extensible Data Analysis tool
Data Quality in the telecomm • Large US phone companies suffer from various DQ-related problems: • Dispatching errors cost money: e.g. sending the wrong technicians to fix a problem, or the right technician to the wrong place. • It is estimated that a 1% reduction in dispatching errors would result in $3M savings /year • For at least one application, one telco company threw away 40% of the data in the DB because it was too expensive to use or attempt to clean it • Customers may receive service before the they are entered in the billing system they get months worth of service for free • Bad provisioning leads to wasted resources (lines allocated to no one) and to allocation of the same lines to multiple customers • The same company cannot tell which wireline customers also subscribe to their own wireless service (the two customer DBs are not integrated)
Data Quality in the Italian Public Administration • Taxpayers’ residence data can be unreliable (out of date or incorrect): a sizable fraction of taxpayers cannot be reached by regular mail • Administrative offices may force input values that violate data constraints • Some taxpayers are assigned more than one SSN, resulting in multiple personae for tax purposes • Taxpayers can be misidentified in the DB because their personal data is incorrect • Part of the information provided by taxpayers in their tax returns is lost and unused once it flows into the system
Data Contains Errors Expected error rates by industry segment Source: Innovative Systems, 1999
The main DQ agenda • Top-level agenda: • Correct current known errors in the data: Data Cleaning • Techniques are borrowed from Statistics, Machine Learning, Text Retrieval, large Databases • Determine the cause for errors and prevent their occurrence: root cause analysisand process analysis • Specifically: • Assess the current state of the data, for various quality dimensions • Predict the future state of the data (what gets dirty, and how fast) • Define improvement goals (estimate ROI(*) figures) • Identify priorities (which data gives me the highest ROI after cleaning)? • E.g. “wireline DB is dirty but its data are less important that wireless…” • “Given a limited budget, cleaning wireless is simpler and more beneficial” • Definedata cleaning plan (methods, tools, extent) Ref.: P.Missier, G.Lalk, V.Verykios , F.Grillo, T.Lorusso, P.Angeletti, Improving Data Quality in Practice: A Case Study in the Italian Public Administration, Distributed and Parallel Databases Journal, Kluwer Academic Publishers, 2001, to appear
Data Quality starts from process analysis • “What processes manipulate which data?” • Without data flow analysis, data cleaning is only a temporary solution: • New data will still be dirty and will contaminate the system • Enforcing Data Quality requires enforcing control over the Data Flows • Suppose data flows F1 and F2 carry logically related data: • When do the flows meet? • Which entity is responsible for which data? • How is data supplied? • How do duplicates propagate? • Are related entries cross-validated? • Where in the system do they meet? • Which procedures are involved? • Is it too late to cross-reference the related data? • Can they be matched earlier? How much would it cost?
Data Flow and Process Analysis • Rationale: • Unpredictability in Data Quality is a consequence of poor control and documentation over the processes that manipulate the data • Questions: • Who can update which data value? On which version? When do synchronization and validation occur? • Goals: • Identify the processes and systems that manipulate the data that is subject to quality control • Determine how that data flows through those processes and systems
Root Cause Analysis: Identify causes of poor data quality • Data-oriented process analysis. Ex.: • When data flows F1 and F2 meet at time t, mismatches about related data records are found. Can data be matched earlier? How much would it cost? • How many times is this address entered, and into how many systems? Are related entries cross-checked? • Mine data on current state of quality. Infer conclusions such as: • “most errors are due to manual data entry, and mostly come from processing centers with specific characteristics” • “systematic data currency errors in system B occur for data that is derived from system A” • “There is a strong correlation between this data being entered late, and the number of spelling errors it contains” • “Quality levels for subscribers’ address data have improved by x% on average after processes A,B,C were re-engineered”
Pub. Admin. System Workflow / System Transactions Pub. Admin. Information System Real World Official paperwork ’ A A’ ’ Administrative Procedures Real world State change Event System State change (multiple DBs) Indirect Triggers Direct Triggers t2 t1 Data flow: Trigger Events, State changes Real world events propagate to Information System through various processes
Indicators • Quantities that describe the degree of correctness and currency of a population of data values, e.g. • % of values that are incorrect • % of values that are obsolete • When Computed: • at fixed intervals (scheduled monitoring) • on demand • when updates are available • Computed based on: sample or entire population • How: • aggregate over the validation state associated with the data
Indicators example • I1: % duplicate SSN • I2: % obsolete residence data • I3: % truncated values in field xyz • I4: % null values in field xyz • I5: % gender values inconsistent with SSN • Given attribute A with values in {a1,…,an}, indicators can also be aggregated on A (e.g. A = set of administrative regions), e.g. • I1i: % duplicate SSN for group ai,, 1 i n
Quality Constraints • Constraints are predicates over the indicators, e.g. • I2 < Y at the end of each month • I5 < Z for values generated before Dec. 31, 1997 • … • Aggregated on A = {a1,…,an}: • I1i < Xi with 1 i n, Xi possibly different for each ai • Parameters in the constraints are usually subject to negotiation between Data Provider and Data Customer; • Trade -off: cost of constraints satisfaction, resulting data quality • more stringent constraints => more expensive, better quality
Classification of data error types • Syntactic correctness: the value is legal and it conforms to format constraints. Ex.: • a multi-part address (town, street, zip code, etc.) • Internal consistency of a credit card number • Misspelling errors • Semantic correctness: “how well does the data represent the true state of the world?” • Ex.: the residence data for a citizen is semantically correct if the address in the record corresponds, to some degree, to the address where the citizen currently resides. • Format and Value Consistency: How many acceptable formats and spelling exist for the same data item (think of all the known abbreviations in street addresses) • Accuracy: we can measure by how much a format and spelling for a street deviates from a defined standard • Value Completeness and Precision: ex.: how many optional elements are present for an address? • Currency: • Is data obsolete? Does it represent the current state of the world?
Correcting data errors: Data Cleaning • Data Cleaning is the process of correcting data errors. • Different techniques are used to deal with each error type • Research work on data quality has traditionally focused on the record matching problem • The goal in the record matching problem is to identify records in the same or different data sources/databases that refer to the same real world entity, even if the records do not match completely. • Record matching techniques only address the non-temporal error types • Challenges: • Accuracy: need to quantify the errors incurred by a cleaning technique • Efficiency: algorithms must be suitable to handle large databases (millions of records)
Name Lastname DOB POB Address SSN DB A DB B Name Lastname Phone Address SSN Record Matching Across Data Sources Non-key, approximate match Key match
The Record Matching Problem • Given a pair of records A and B in two data sources (could be the same source!), the matching algorithms attempt to determine if: • A and B represent the same entity: Match • A and B represent two different entities: Non-match • Within a single database, a duplicate is a record that represents the same entity as another record in the same database. • Common record identifiers such as names, addresses, and code numbers (SSN, object identifier), are the matching variables that are used to identify matches. Ref.: Verykios, V.S., Elmagarmid, A.K., Elfeky, M., Cochinwala, M., and Dalal, S., A Survey on Record Linkage Techniques, unpublished.
Steps in record matching • Data preparation • Normalize values • Bring field values to a standard format, based on the consistency rules • Searching • Build pairs of records that are potential matches within the databases • Challenges: How to avoid building all n2 potential pairs • Matching • Determine whether or not a pair of records is actually a match
S OSVALDO 16 DRUSO 1 FONDO D ORTO 7 FONDO DORTO 9 FONDO D ORTO 9 FONDO DORTO 7 NAPOLI 281 TRAV. LATTARO 2 COGNULO 20 TRAV. LATTARO 2 BRACCO 7 BENEDETTO BRIN 79 A.DE GASPERI 108 CSA ALCIDE DE GASPERI 105B LO MARGI MONTE CDA MARGI-MONTE SANTO 5 DON BOSCO 5A Searching How many record pairs do I need to consider?
Complexity reduction in Searching • Brute-force approach is quadratic in the number of records • Cluster based on the values of some grouping variables or some a-priori knowledge. Ex.: • If I know the “place of birth” is clean, then I can build as many clusters as there are cities. • Sorting according to a comparison key will bring “similar” records close to each other. • Blocking: only compare records that agree on the comparison key • Windowing: only compare records that lie within a window of given size after sorting • Complexity reduction incurs the risk of missing valid matches. • Metrics have been developed to evaluate the accuracy of a clustering and sorting technique
Matching: field values distance functions • First, we define functions that measure the distance between two values. Different functions are used for different value types. • The distance function definition depends on the specific data domain: • Standard Strings Comparison functions: There are many ways to define the distance between two strings • Which one is more appropriate? • Other data types require ad hoc comparison functions: • Ex.: “Codice Fiscale”: what is the distance between two CFs?
Simple Examples of String Distance • Simple edit distance: • d(A,B) = number of single-char insert/delete required to map string A to string B • d(“smith”, “mitch”) = 2(“mitch” +’s’ -‘c’ = “smith”) • d(“mitch’, “mitcheli”) = 3 • Weighted edit distances: different chars have different weights: • if blanks only count .5, then d(“lungotevere Massimo”, “lungo tevere Masimo”) = 1.5 • Different fields contribute different weights: d([piazza, Toti], [via, Toti]) < 6 because d(“piazza”, “via”) < 6 in this context. • Distance can also be a discrete function, e.g.: d(piazza, via) = X, d(via, strada) = Y, ...
Basic Field Matching Algorithm • Degree of match in two composite fields: • “number of matching atomic strings” / “avg. number of atomic strings” • atomic strings S1, S2 match if: • S1 = S2, or S1 = S2 + • Ex.: A = “Comput Sci Eng Dept University of California San Diego” B= “Department of Computer Science Univ Calif San Diego" • k = 6 strings in A match some string in B: Comput, Sci, San, Diego, Univ, and Calif. • The overall matching score is k / (|A| + |B|)/2) = 0.8.
Other string comparison functions • The type of string comparison used depends on the specific usage context. Here are four examples: Source: W. E. Winkler, Advanced Methods for record linkage, US Census Bureau TechReport, RR94/05, 1994.
“Group distance” – example of correct matches Anagrafe Tributaria: Comparing Archivio Anagrafico with data from Comuni Anagrafe Tributaria: Comparing Archivio Anagrafico with Dichiarazioni
“Group distance” – correctly identified mismatches Anagrafe Tributaria: Archivio Anagrafico Comuni: group distance = 7
Other String comparison functions • Many other functions can be used to define the distance between two strings, i.e.: • Smith-Waterman: Given two strings, the Smith-Waterman algorithm uses dynamic programming to find the lowest cost series of changes that converts one string to another. • N-grams: Letter n-grams, including trigrams, bigrams, and/or unigrams, have been used in a variety of ways in text recognition and spelling correction techniques • Jaro algorithm: a string comparison algorithm that accounts for insertions, deletions, and transpositions. • Russell Soundex code is a phonetic coding, based on the assignment of code digits which are the same for any of a phonetically similar group of consonants. It is used most frequently to bring together variant spellings of what are essentially the same names.
Comparison Vectors • For each pair of records identified in the search phase, we compare their field values using the distance functions. • Example: • A= [CF1, Nome1, Cognome1, DataNascita1, LuogoNascita1, Sesso1], • B= [CF2, Nome2, Cognome2, DataNascita2, LuogoNascita2, Sesso2], • Compare: • CF with CF d = 3.2 • Nome1 + Cognome1 with Nome2 + Cognome2 d = 2.4 • DataNascita1 with DataNascita2 d = 10 • LuogoNascita1 with LuogoNascita2 d = 3.5 • Sesso1 with Sesso2 d = 0 • The resulting vector of comparison values is called the comparison vector: • Final vector = [ 3.2, 2.4, 10, 3.5, 0]
Using Comparison Vectors: Approaches to record matching • We now have a potentially very large collection of comparison vectors. • For each vector, we need to determine if it represents: a match, a non-match, or an undecided pair • This is a classification problem • Probabilistic approach: • Hypothesis testing: “pick the match status that minimizes cost or error” • Equational theory model / the merge-purge algorithm (S. Stolfo) • Introduce knowledge that can be used to make inferences about the match status of two records
Examples of equational rules /* RULE: compare-addresses */ similar_addrs = compare_addresses(person1stname,person2stname); /* RULEs: closer-addresses-use-zips and closer-address-use-states*/ very_similar_addrs = (similar_addrs && similar_city &&(similar_state || similar_zip)); /* RULEs: same-ssn-and-address and same-name-and-address */ if ((similar_ssns || similar_names) && very_similar_addrs) { merge_tuples(person1, person2); continue; }
Accuracy of merge/purge • Accuracy measured in % duplicates detected on synthetic data and % false positives
The 1999 Ministry of Finance study on the Quality of its data Specific Goals: • To assess the quality of addresses in DB Archivio Anagrafico with respect to: • currency (how many records do not reflect the real taxpayer’s address) • correctness (how many addresses are spelled out correctly) • To determine the better reference source for cleaning the Archivio Anagrafico • To detect unknown duplicate Codici Fiscali (CFs) in DB Archivio Anagrafico
Approach • Addresses: • sample AA (about 275,000 records) according to various criteria (mainly, availability of corresponding Comuni data) • match addresses against recent, official residence data (Comuni) for the same taxpayers (same CF) • match addresses against recent data from Dichiarazioni for the same taxpayers (same CF) • Duplicate CFs: • sample about 600,000 records from three Comuni (Bari, Carbonara di Bari, Chioggia) • from the population, a set of 34,244 records (5.71%) are known duplicates (“connessi”) • Detect patterns of anomalies that lead to known duplicate CFs, using “connessi” as the training set. Induce heuristics. • Process entire population automatically using the induced heuristics
General Classification approach: Supervised Learning • Described in M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha. Efficient Data Reconciliation, Bellcore Research, February 1998. • Use of machine learning and statistical techniques for reducing the matching complexity for large datasets. • Use a machine learning algorithm (CART, a well-known decision tree inducer) to generate matching rules. • The system generates rules with very high accuracy, and low complexity. The automatically reported rules increased the accuracy of the matching process from 63% to 90%. • Follow-up Research work still in progress
An Extensible Framework for Data Cleaning-- INRIA, France • Apply a set of transformations to the source data • Framework: a platform offering data manipulation services: • Data Transformation for standardization of attribute values • Multi-table Matching, Approximate Join, similarity functions, clustering • Duplicate Elimination • Implementation: • SQL-like Macro-operators for: • Data Mapping (CREATE MAPPING) • Data Matching (CREATE MATCH) • Data Clustering (CREATE CLUSTER) • Data Decision (CREATE DECISION)
Key contributions • Open framework with extensible library of user-defined functions for • attribute value similarity matching • clustering strategy • Expressive SQL-like macro commands. • Cartesian product semantics for tuple matching • Optimizations: • mixed execution supported by regular SQL engine + homegrown extensions • improve upon the baseline Cartesian-product complexity of the matching for large datasets (>100,000 tuples) Ref.: H.Galhardas, D. Florescu, D. Shasha, E.Simon, An Extensible Framework for Data Cleaning, Procs. EDBT, 1999
Telcordia’s Data Analysis Toolkit Toolkit Demo
Framework • Rules and solutions differ from domain to domain • Flexibility in resolution rules is essential • Preprocessing and cleaning varies with the nature of data • However, in most cases a common pattern emerges: • Data is handled in stages, • Each stage is responsible for one step of transformation (matching, merging, cleaning, etc.) and • Stages can be chained together to produce an overall data flow. • We define a framework that allows • Any number of custom-made processing blocks • Combination of those blocks in various ways
Approach • Data-flow architecture • The framework enforces a simple producer-consumer model between pairs of functional blocks • Workflow-style, graphical design of ad hoc analysis processes • Flexibility and extensibility: • Functional Blocks are independent components that interoperate through a data-flow infrastructure • Specific user-defined matching, clustering, and other analysis algorithms are incapsulated and can be plugged in Toolkit Demo
Summary and conclusions • Data Quality improvement calls for a comprehensive methodology that includes • Process analysis and re-engineering • Data analysis and cleaning • Data cleaning and data analysis automation are still more an art than a craft: • A variety of sophisticated techniques has been used to tackle the cleaning problem (record matching), borrowing from Statistics, Machine Learning, Text Retrieval • Current commercial tools are expensive and only solve part of the problem • Developing cleaning solutions that include re-engineering requires experience and domain espertise • The “open toolkit” approach begins to enter the DQ tools space
References • Verykios, V.S., Elmagarmid, A.K., Elfeky, M., Cochinwala, M., and Dalal, S., On the Completeness and Accuracy of the Record Matching Process. In Proceedings of the 2000 Conference on Information Quality, pp. 54-69, October 2000, Boston, MA. • M. A. Hernadez and S. J. Stolfo. The merge-purge problem for large databases, In Proc. of the 1995 ACM SIGMOD Conference, pages 127--138, 1995. • H.Galhardas, D. Florescu, D. Shasha, E.Simon, An Extensible Framework for Data Cleaning, Procs. EDBT, 1999. • M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha. Efficient Data Reconciliation, Bellcore Research, February 1998. • Jeremy A. Hylton, Identifying and Merging Related Bibliographic Records, Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1996. • Matthew A. Jaro, Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84 (1989), no. 406, 414-420. • Alvaro E. Monge and Charles P. Elkan, The Field Matching Problem: Algorithms and Applications, Second International Conference on Knowledge Discovery and Data Mining (1996), 267-270, AAAI Press. • W.E. Winkler, Advanced Methods for Record Linking, Proceedings of the Section on Survey Research Methods (American Statistical Association), 1994, pp. 467-472. • W. E. Winkler, The State of Record Linkage and Current Research Problems, Proceedings of the Survey Methods Section (1999), 73-79.