460 likes | 736 Views
Information Integration Lecture 13. Data Cleaning. Michael Kassoff Spring 2003. Big Picture. Problem Statement – What is the desired end result? (“Business requirements”)
E N D
Information Integration Lecture 13 Data Cleaning Michael Kassoff Spring 2003
Big Picture • Problem Statement – What is the desired end result? (“Business requirements”) • Data Location / Collection – Find some data to integrate, and/or make some yourself. Put the data in staging tables if desired and possible. • Data Cleaning - Get rid of bad data • Data Integration– See first 12 lectures of CS 246
Staging Tables • Local tables in a database in the schema of imported data and with ‘dirty data’ • May have more staging tables as intermediates between imported schema staging tables and local schema • Especially useful if importing data infrequently
Data Cleaning • That’s what this lecture is about! • Why is data cleaning needed? • How do we clean it?
Data is often of low quality • Why? • You didn’t collect it yourself! • It probably was created for some other use, and then you came along wanting to integrate it • People make mistakes (typos) • People are busy (“this is good enough”)
Problems with data • Some data have problems on their own • Other data is problematic only when you want to integrate it
Problems due to lack of structure • Nowadays, we have databases and XML • Before, everyone had their own way of structuring data, based on what was convenient for them • Even today, not everyone uses structured data • Convenience • Lack of time • Efficiency of representation
Case study: government agency data • What we want:
First problem What’s wrong here? 1'Dept. of Transportation'New York'NY 2'Dept. of Finance'New York'NY 3'Office of Veteran's Affairs'New York'NY • The separator is used in the data. • Easy to miss if you don’t check the # of columns when parsing each row.
Second problem What’s wrong here? 1,Dept. of Transportation,New York City,NY 2,Dept. of Finance,City of New York,NY 3,Office of Veteran's Affairs,New York,NY • We need standardization / naming conventions • Could we enforce this in a database? In XML?
Third problem What’s wrong here? 1,Dept. of Transportation,New York,NY ,Dept. of Finance,New York,NY 3,Office of Veteran's Affairs,New York,NY • A missing required field • Couldn’t occur in a database because of primary key constraint • Could fix this in XML with a REQUIRED tag
Fourth problem What’s wrong here? 1,Dept. of Transportation,New York,NY Two,Dept. of Finance,New York,NY Office of Veteran's Affairs,3,New York,NY • No data type constraints • Can’t fix in XML w/ DTDs. • XML Schema fixes this problem
Fifth Problem What’s potentially a problem here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Commission for the United Nations Consular Corps and Protocol,New York,NY • Field longer than documentation tells us. • XML DTDs cannot help, XML Schema can help
Sixth Problem What’s wrong here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 2,Office of Veteran's Affairs,New York,NY • Primary key constraint violation • XML can help us if we use ID attribute
Seventh Problem What’s wrong here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Dept. of Finance,New York,NY • Redundancy! • XML can’t help here • Databases can help (use constraints) • Solution is to normalize data
Eighth problem What’s wrong here? Contract 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Office of Veteran's Affairs,New York,NY • Contractual obligations disallow you from using the data as you like
Flags • More subtle if they are in a database • Most frequently used flags: 0s and 9s • Signify ‘no data’ or ‘out of range’ • Be careful: sometimes 0 is a better value for ‘out of range’ than is null (Ex: trace elements)
Typos • How to catch them? • Pay someone to manually check all data • Sort data by frequency • ‘green’ is more frequent than ‘rgeen’ • Works well for categorical data • Use constraints • Use statistical techniques
Using Constraints to Catch Corrupt Data • Foreign key constraints work well if data possibilities can be pre-enumerated • Numerical constraints • Weight can’t be negative • People can’t have more than 2 parents • Women can’t bear 80 children
Using Statistical Techniques to Catch Corrupt Data • Check for outliers (the case of the 60 foot man) • Check for correlated outliers (“pregnant males”) • People can be male • People can be pregnant • People can’t be male AND pregnant
Hidden Semantics • Even in a structured document, not all data is explicit • Order may be important, even if there is no explicit ‘order tag’ • Whitespace may be important
Hidden Semantics Example: MYCIN rules • MYCIN is a program that uses a knowledge base of rules to diagnose a patient using backward chaining • The order of the rules is important – using the same rules in a different order produces different results • There is nothing in the rules themselves that suggests the importance of order
Hidden Semantics Example: Gene Ontology (GO) • Above is some data from the GO • The whitespace is important • What does the whitespace mean? I have no idea. I’d need to look at the documentation.
Misleading data • The table on the top is data as it is given to us • The table on the bottom is what the data “means” • Whitespace is used in a misleading way
Data that that is fine on its own, but becomes problematic when you want to integrate it
Formats • Not everyone uses the same format as you • Dates are especially problematic: • 12/19/77 • 12/19/1977 • 12-19-77 • 19/12/77 • Dec 19, 1977 • 19 December 1977 • 9 in Tevet, 5738
Data that Moves • Be careful of taking snapshots of a moving target • Let’s say you want to store the price of a skillet in France, and the price of a pot in Germany • You can’t store it all in the same currency (say, US$) because the exchange rate changes • Price in foreign currency stays the same • Must keep the data in foreign currency and use the current exchange rate to convert
Data at a different level of detail than you need • If it is at a finer level of detail, you can sometimes bin it • Example • I need age ranges of 20-30, 30-40, 40-50, etc. • Imported data contains birth date • No problem! Divide data into appropriate categories
Data at a different level of detail than you need (cont’d) • Sometimes you cannot bin it • Example • I need age ranges 20-30, 30-40, 40-50 etc. • Data is of age ranges 25-35, 35-45, etc. • What to do? • Ignore age ranges because you aren’t sure • Make educated guess based on imported data (e.g., assume that # people of age 25-35 are average # of people of age 20-30 & 30-40)
Conflicting Data • Information source #1 says that George lives in Texas • Information source #2 says that George lives in Washington, DC • What to do? • Use both (He lives in both places) • Use the most recently updated piece of info • Use the “most trusted” info • Flag row to be investigated further by hand • Use neither (We’d rather be incomplete than wrong)
Data Cleaning seems like a big pain. Can I avoid it somehow?
One Possibility: Do Nothing • Maybe you don’t want to clean up the data at all • Example: RiboWeb • RiboWeb consists of integrated data from scientific papers about ribosomal structure • Even if data is known to be incorrect, it is kept in the RiboWeb database • The source of the data is contained in the schema, allowing you to judge for yourself the likely quality of the data • Author • Journal name • Whether paper was reviewed by peers
5 Steps of Data Cleaning • Semi-structure • Standardize • Local consistency check • Global consistency check • Document
Example example adapted from Dealing with Dirty Data By Ralph Kimball DBMS, September 1996 Address entry from unstructured file: Ralph B and Julianne Kimball Ste. 11613150 Hiway 9Box 1234 Boulder CrkColo 95006
Semi-structuring Also known as parsing: Addressee First Name(1): RalphAddressee Middle Initial(1): BAddressee Last Name(1): KimballAddressee First Name(2): JulianneAddressee Last Name(2): KimballStreet Address Number: 13150Street Name: Hiway 9Suite Number: 116Post Office Box Number: 1234City: Boulder CrkState: ColoFive Digit Zip: 95006
Standardize • We should replace synonyms with one standard term • Hiway 9 Highway 9 • Boulder Crk Boulder Creek • Colo Colorado
Local consistency check Does each piece of data make sense on its own? • Boulder Creek and Zip Code 95006 are in California • State was listed as Colorado • 2/3 attributes point to California as the correct state. Change state to California.
Global consistency check • Find either Ralph Kimball or Julianne Kimball in other customer records and ensure that all of the elements of all the addresses are identical. • This differs from the sanity check because it looks at other pieces of data, not just the current one
Documenting Document the results of semi-structuring, standardizing, and consistency checking in metadata • Important for users of the integrated database • Important for doing future updates of the database