1.41k likes | 1.7k Views
Data Cleaning and Transformation. Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004) ). Agenda. Introduction
E N D
Data Cleaning and Transformation Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))
Agenda • Introduction • Data (Quality) Problems • Data Cleaning and Transformation • Data Quality • (Data Quality) Dimensions • Techniques • Object Identification • Data integration/fusion • Tools
Data Extraction Data Transformation Data Loading When materializing the integrated data (data warehousing)… SOURCE DATA TARGET DATA ... ... ETL: Extraction, Transformation and Loading 70% of the time in a datawarehousing project is spent with the ETL process
Why Data Cleaning and Transformation? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“” noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field) • e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials) • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records
Why Is Data Dirty? • Incomplete data comes from: • non available data value when collected • different criteria between the time when the data was collected and when it is analyzed. • human/hardware/software problems • Noisy data comes from: • data collection: faulty instruments • data entry: human or computer errors • data transmission • Inconsistent (and redundant) data comes from: • Different data sources, so non uniform naming conventions/data codes • Functional dependency and/or referential integrity violation
Why is Data Quality Important? Activity of converting source data into target data without errors, duplicates, and inconsistencies, i.e., Cleaning and Transforming to get… High-quality data! • No quality data, no quality decisions! • Quality decisions must be based on good quality data (e.g., duplicate or missing data may cause incorrect or even misleading statistics)
Data Quality: Multidimensional Concept • Completeness • Accuracy • Jhn vs. John • Currency • Residence (Permanent) Address: out-dated vs. up-to-dated • Consistency • ZIP Code and City consistent
Application contexts • Eliminate errors and duplicates within a single source • E.g., duplicates in a file of customers • Integrate data from different sources • E.g.,populating a DW from different operational data stores • Migrate data from a source schema into a different fixed target schema • E.g., discontinued application packages • Convert poorly structured data into structured data • E.g., processing data collected from the Web
yes semi totally yes Types of information systems Distribution CISs P2P Systems Monolithic DW Systems Autonomy no Heterogeneity
Types of data • [Dasu et al 2003] • Federated data, that come form different heterogeneous sources • Massive high dimensional data • Descriptive data • Longitudinal data, consisting in time series • Web data • [Batini et al 2003] • Elementary data • Aggregated data • Many authors • Structured • Semi structured • Unstructured • [Shankaranaian et al 2000] • Raw data • Component data • Information products • [Bouzenghoub et al 2004] • Stable, • Long term changing • Frequently changing
Research areas Research areas in DQ systems Models Methodologies Dimensions Measurement/Improvement Techniques Measurement/Improvement Tools and Frameworks EGov Scientific Data Web Data Application Domains …
Research issues related to DQ • Source Selection • Source Composition • Query Result Selection • Time Syncronization • … • Record Matching(deduplication) • Data Transformation • … • Conflict Resolution • Record Matching • … Data Quality Data Integration Data Cleaning Statistical Data Analysis Data Mining • Edit-imputation • Record Linkage • … Management Information Systems • Error Localization • DB profiling • Patterns in text strings • … Knowledge Representation • Assessment • Process Improvement • Tradeoff Cost/Optimization • … • Conflict Resolution • …
Research issues mainly addressed in the book Data Quality Data Integration Data Cleaning Statistical Data Analysis Data Mining Management Information Systems Knowledge Representation
Existing technology • Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language • Programs difficult to optimize and maintain • RDBMS mechanisms for guaranteeing integrity constraints • Do not address important data instance problems • Data transformation scripts using an ETL (Extraction-Transformation-Loading)ordata quality tool
Human Knowledge Human Knowledge Data Transformation Typical architecture of a DQ system TARGET DATA SOURCE DATA Data Extraction Data Transformation Data Loading ... ... Data Analysis Metadata Dictionaries Schema Integration
Traditional data quality dimensions • Accuracy • Completeness • Time-related dimensions: Currency, Timeliness, and Volatility • Consistency • Schema quality dimensions are also defined • Their definitions do not provide quantitative measures so one or more metrics have to be associated • For each metric, one or more measurement methods have to be provided regrading: (i) where the measurement is taken; (ii) what data are included; (iii) the measurement device; and (iv) the scale on which results are reported.
Accuracy • Closeness between a value v and a value v’, considered as the correct representation of the real-world phenomenon that v aims to represent. • Ex: for a person name “John”, v’=John is correct, v=Jhn is incorrect • Syntatic accuracy: closeness of a value v to the elements of the corresponding definition domain D • Ex: if v=Jack, even if v’=John , v is considered syntactically correct • Measured by means of comparison functions (e.g., edit distance) that returns a score • Semantic accuracy: closeness of the value v to the true value v’ • Measured with a <yes, no> or <correct, not correct> domain • Coincides with correctness • The corresponding true value has to be known
Ganularity of accuracy definition • Accuracy may refer to: • a single value of a relation attribute • an attribute or column • a relation • the whole database • Example of metric to quantify data accuracy: • Ratio calculated between accurate values and the total number of values.
Completeness • “The extent to which data are of sufficient breadth, depth, and scope for the task in hand.” • Three types: • Schema completeness: degree to which concepts and their properties are not missing from the schema • Column completeness: measure of the missing values for a specific property or column in a table. • Population completeness: evaluates missing values with respect to a reference population
Completeness of relational data • The completeness of a table characterizes the extent to which the table represents the real world. • Can be characterized wrt: • The presence/absence and meaning of null values • Ex: Person(name, surname, birthdate, email), if email is null may indicate the person has no mail (no incompleteness), email exists but is not known (incompletenss), is is not known whether Person has an email (incompleteness may not be the case) • Validity of open world assumption (OWA) or closed world assumption (CWA) • OWA: cannot state neither the truth or falsity of facts not represented in the tuples of a relation • CWA: only the values actually present in a relational table and no other values represent facts of the real world.
Metrics for quantifying completeness (1) • Model without null values with OWA • Need a reference relation r’ for a relation r, that contains all the tuples that satisfy the schema of r C(r) = |r|/|ref(r)| • Ex: according to a registry of Lisbon municipality, the number of citizens is 2 million. If a company stores data about Lisbon citizens for the purpose of its business and that number is 1,400,000 then C(r) = 0,7
Metrics for quantifying completeness (2) • Model with null values with CWA: specific definitions for different granularities: • Values: to capture the presence of null values for some fields of a tuple • Tuple: to characterize the completeness of a tuple wrt the values of all its fields: • Evaluates the % of specified values in the tuple wrt the total number of attributes of the tuple itself • Ex: Student(stID, name, surname, vote, examdate) • (6754, Mike, Collins, 29, 7/17/2004) vs (6578, Julliane, Merrals, NULL, 7/17/2004)
Metrics for quantifying completeness (3) • Attribute:to measure the number of null values of a specific attribute in a relation • Evaluates % of specified values in the column corresponding to the attribute wrt the total number of values that should have been specified. • Ex: For calculating the average of votes in Student, a notion of the completeness of Vote should be useful • Relations: to capture the number of null values of a specific attribute in a relation • Measures how much info is represented in the relation by evaluating the content of the info actually available wrt the maximum possible content, i.e., without null values.
Time-related dimensions • Currency: concerns how promptly data are updated • Ex: if the residential address of a person is updated (it corresponds to the address where the person lives) then it is updated • Volatility: characterizes the frequency with which data vary in time • EX: Birth dates (volatility zero) vs stock quotes (high degree of volatility) • Timeliness: expresses how current data are for the task in hand • Ex: The timetable for university courses can be current by containing the most recent data, but it cannot be timely if it is available only after the start of the classes.
Metrics of time-related dimensions • Last update metadata for currency • Straightforward for data types that change with a fixed frequency • Length of time that data remain valid for volatility • Currency + check that data are available before the planned usage time for timeliness
Consistency • Captures the violation of semantic rules defined over a set of data items, where data items can be tuples of relational tables or records in a file • Integrity constraints in relational data • Domain constraints, Key, inclusion and functional dependencies • Data edits: semantic rules in statistics
Evolution of dimensions • Traditional dimensions are Accuracy, Completeness, Timeliness, Consistency • With the advent of networks, sources increase dramatically, and data become often “found data”. • Federated data, where many disparate data are integrated, are highly valued • Data collection and analysis are frequently disconnected. • As a consequence we have to revisit the concept of DQ and new dimensions become fundamental.
Other dimensions • Interpretability: concerns the documentation and metadata that are available to correctly interpret the meaning and properties of data sources • Synchronization between different time series: concerns proper integration of data having different time stamps. • Accessibility: measures the ability of the user to access the data from his/her own culture, physical status/functions, and technologies availavle.
Techniques • Relevant activities in DQ • Techniques for record linkage/object identification/record matching • Techniques for data integration
Relevant activities in DQ • Record Linkage/Object identification/Entity identification/Record matching • Data integration • Schema matching • Instance conflict resolution • Source selection • Result merging • Quality composition • Error localization/Data Auditing • Data editing-imputation/Deviation detection • Profiling • Structure induction • Data correction/data cleaning/data scrubbing • Schema cleaning • Tradeoff/cost optimization
Record Linkage/Object identification/ Entity identification/Record matching • Given two tables or two sets of tables, representing two entities/objects of the real world, find and cluster all records in tables referring to the same entity/object instance.
Data integration (1) 1. Schema matching • Takes two schemas as input and produces a mapping between semantically correspondent elements of the two schemas
Data Integration (2) 2. Instance conflicts resolution & merging • Instancelevel conflicts can be of three types: • representation conflicts, e.g. dollar vs. euro • key equivalence conflicts, i.e. same real world objects with different identifiers • attribute value conflicts, i.e. instances corresponding to same real world objects and sharing an equivalent key, differ on other attributes
Data Integration (3) 3. Result merging: it derives from the combination of individual answers into one single answer returned to the user • For numerical data, it is often called fused answer and a single value is returned as an answer, potentially differing from each of the alternatives
Data Integration (4) 4. Source selection • Querying a multidatabase with different sources characterized by different qualities 5. Quality composition • Defines an algebra for composing data quality dimension values
Error localization/Data Auditing • Given one/two/n tables or groups of tables, and a group of integrity constraints/qualities (e.g. completeness, accuracy), find records that do not respect the constraints/qualities. • Data editing-imputation • Focus on integrity constraints • Deviation detection • data checking that marks deviations as possible data errors
Profiling • Evaluating statistical properties and intensional properties of tables and records • Structure induction of a structural description, i.e. “any form of regularity that can be found”
Data correction/data cleaning/data scrubbing • Given one/two/n tables or groups of tables, and a set of identified errors in records wrt to given qualities, generates probable corrections (deviation detection) and correct the records, in such a way that new records respect the qualities.
Schema cleaning • Transform the conceptual schema in order to achieve or optimize a given set of qualities (e.g. Readability, Normalization), while preserving other properties (e.g. equivalence of content)
Tradeoff/cost optimization • Tradeoff – When some desired qualities are conflicting (e.g. completeness and consistency), optimize such properties according to a given target • Cost – Given a cost model, optimize a given request of data quality according to a cost objective coherent with the cost model
Techniques for Record Linkage/Object identification/Entity identification/Record matching
Techniques for record linkage/object identification/record matching • Introduction to techniques • General strategies • Details on specific steps • Short profiles of techniques • [Detailed description] • [Comparison]
Id Name Type of activity City Address • The same business as represented in the three most important business data bases in central agencies in Italy: • CC Chambers of Commerce • INPS Social security • INAIL Accident Insurance An example
Crlo Batini Pscara Itly Europe Pescara Carlo Batini Pscara Pscara Italy Europe Record linkage and its evolution towards object identification in databases …. Record linkage First record and second record represent the same aspect of reality? Crlo Batini 55 Carlo Btini 54 Object identification in databases First group of records and second group of records represent the same aspect of reality?
Type of data considered • (Two) formatted tables • Homogeneous in common attributes • Heterogeneous • Format (Full name vs Acronym) • Semantics (e.g. Age vs Date of Birth) • Errors • (Two) groups of tables • Dimensional hierarchy • (Two) XML documents
<country> United States <city> New York </city> <city> Los Angeles </city> <lakes> <lake> Lake Michigan </lake> </lakes> </country> …and object identification in semistructured documents <country> <name> United States of America </name> <cities> New York, Los Angeles, Chicago </cities> <lakes> <name> Lake Michigan </name> </lakes> </country> and are the same object?