Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI

Finding and Fixing Data Quality ProblemsNEARC Fall, 2010 Newport, RI Brian Hebert, Solutions Architect www.scribekey.com

Goal: Help You Improve Your Data • Provide definition for Data Quality • Consider Data Quality within context of several data integration scenarios • Suggest a framework and workflow for improving Data Quality • Review tools and techniques, independent of specific products and platforms • Help you plan and execute a Data Quality improvement project or program • Review take-aways and Q&A www.scribekey.com

Essential Data Quality Components Meaning Structure Contents Data is well understood, well structured, and fully populated with the right values FOR END USE. Note: These fundamental elements of data quality overlap. www.scribekey.com

Data Quality (DQ) Defined Meaning: Names and definitions of all layers and attributes are fully understood and clear for end users community (a.k.a. semantics). Structure: The appropriate database design is used including attribute data types, lengths, formats, domains (lookup tables), and relationships. Contents: The actual data contents are fully populated with valid values and match meaning and structure. Metadata: Meaning, Structure, Contents described in Data Dictionary or a similar metadata artifact. www.scribekey.com

Scenarios: DQ Improvement as Data Integration Source Target • You want to improve the data quality in a stand alone independent dataset. Some aspect of meaning, structure, contents can be improved. • You want to combine multiple disparate datasets into a single representation. Departments, organizations, systems, functions, are merging or need to share info Source1 Target Source2 For both cases, many of the same tools and techniques can be used. In fact, it’s often beneficial, in divide and conquer approach, to always start with 1 www.scribekey.com

Typical Data Quality/Integration Situations Data is in different formats, schemas, versions, but provides some of the same information, examples: • You need to clean up a single existing dataset • 2 departments in utility company: Customer billing and outage management, central db and field operations • Merging 2 separate databases/systems: getting town CAMA data into GIS • Consolidating N datasets: MassGIS Parcel Database, CDC Disease Records from individual states • 2 city/state/federal organizations: Transportation and Emergency Management need common view • Preparing for Enterprise Application Integration: wrapping legacy systems in XML web services www.scribekey.com

Scenario 1 Case Study: Cleaning Up Facility Data • Organization maintains information on facility assets. • The information includes data describing basic location, facility type, size, function, and contact information. • Organization needs decision support database. • Data has some quality issues. • Case is somewhat generic, could apply to buildings, complexes, sub-stations, exchange centers, industrial plants, etc. • Idea: Some identification with your data. www.scribekey.com

Solution: Workflow Framework and Foundation - Data Integration Support Database and Ordered Tasks INVENTORY COLLECTION APPLICATIONS DATA PROFILING INTEGRATION SUPPORT DB VALIDATION STANDARDIZE & MAP/GAP CENTRAL RDB Iterative Operations ETL SCHEMA GENERATION www.scribekey.com

Solution Support: Ordered Workflow Steps • Inventory: What do we have, where, who, etc.? • Collection: Get some samples. • Data Profiling and Assessment: Capture and analyze the meaning, structure, and content of source(s) • Schema Generation: One or several steps to determine what target data should look like. • Schema Gap and Map: What are differences, description of how we get from A to B • ETL: Physical implementation of getting from A to B, code, SQL, script, etc. • Validation: Do results match goals? • Applications: Test data through applications, aggregations. • Repeat Processing for Updates: Swap in a newer version of data source A. www.scribekey.com

Inventory: The Dublin Core (+) http://dublincore.org/documents/dces Question: How do you capture information on existing data? www.scribekey.com

Multiple Data Description Sources for Inventory Website Metadata Documentation INVENTORY Email People/SME’s Data Itself Gather info about data from a variety of sources www.scribekey.com 11

The Data Profile: Meaning, Structure, Contents The Table Profile is helpful for getting a good overall idea of what’s in a database The Column Profile is helpful for getting a detailed understanding of database structure and contents www.scribekey.com

How Data Profiling Works Data Profiling (and Metadata Import) Data Profiler Roads FGDC XML Metadata Integration Support DB Parcels Data Dictionary XML Metadata Import Buildings No Metadata, End User • The profiler is an application that reads through data and gets names, structure, contents, patterns, summary statistics. You can also learn about data through documentation and end users www.scribekey.com

Data Profiling: Table Information • The Table Profile gives a good overview of record counts, number of columns, nulls, general completeness, and a list of column names. • Very helpful for quickly getting an idea of what’s in a database and comparing sets of data which need to be integrated. www.scribekey.com

Data Profiling: Column Information • The Columns Profile provides detailed information on the structure and contents of the fields in the database. • It provides the foundation for much of the integration work that follows. www.scribekey.com

Data Profiling: Domain Information • Domains provide one of the most essential data quality control elements. • List Domains are lists of valid values for given database columns, e.g., standard state abbreviations MA, CT, RI, etc. for State • Range domains provide minimum and maximum values, primarily used for numeric data types. • Many domains can be discovered and/or generated from the profile. Question: Do you use profiling to get a concise summary of data details? www.scribekey.com

Workshop Exercise: Example Profile www.scribekey.com

Workshop Exercise: Column Profile Analysis • Are the column name and definition clear? • Are there other columns in other tables with the same kind of data and intent with a better name? • Is the data type appropriate, e.g., many times numeric data and dates can be found in text fields. Is the length appropriate? • Is this a unique primary key? Does the number of values equal the number of records? Is it a foreign key? • Is this column being used? Is it empty? How many null values are there? What percent complete is the value set? • Is there a rule which can be used to validate data in this column as: • List Domain or Range Domain • Specific Format (regular expression) • Other rules, possibly involving other columns • Does the data indicate that the column should be split into another table and use an 1->N parent child relationship? The profile itself is generated automatically. The real value is in the results of the analysis: what needs to be done to data? www.scribekey.com

Data Profilers • http://www.talend.com/products-data-quality/talend-open-profiler.php • http://en.wikipedia.org/wiki/DataCleaner • http://weblogs.sqlteam.com/derekc/archive/2008/05/20/60603.aspx • http://www.dba-oracle.com/oracle_news/2005_12_29_Profiling_and_Cleansing_Data_using_OWB_Part1.htm • Request form at www.scribekey.com for profiler (shareware) Learn about and test profilers with your own data. www.scribekey.com

Schema Generation Options • Use a data driven approach. Define the new schema as more formally defined meaning, structure, and contents of source data. • Use an external independent target schema. Sometimes this is a requirement. • In divide and conquer approach, use data-driven first as staging schema. Improve data in and of itself. Then consider ETL to more formal, possibly standard, external target schema. • Use a combination hybrid, using elements of both data-driven and external target schemas. www.scribekey.com

Data Model Differences: Production vs. Decision Support Normalized for referential integrity, complex and slower performing queries, data is edited De-normalized for easily formed and faster performing queries, data is read-only The data models and supporting tools used in data warehousing are significantly different from those found across the geospatial community. Geospatial data modelers tend to incorrectly use production models for decision support databases. www.scribekey.com

Normalization • Normalization can get complicated, 1st, 2nd, 3rd Forms, Boyce-Codd, etc. • Some important basics: • Don’t put multiple values in a single field • Don’t grow a table column wise by adding values over time • Have a primary key • However, you should de-normalize when designing read-only decision support databases to facilitate easy query formation and better performance www.scribekey.com

De-Normalization and Heavy IndexingMakes Queries Easier and Faster FACILITIES • 1 De-Normalized Table: SELECT TYPE, LOCATION FROM FACILITIES • 3 Normalized Tables: SELECT FACILITY_TYPES.TYPE, LOCATIONS.LOCATION FROM (FACILITIES INNER JOIN FACILITY_TYPES ON FACILITIES.TYPE = FACILITY_TYPES.ID) INNER JOIN LOCATIONS ON FACILITIES.LOCATIONID = LOCATIONS.ID; • NAVTEQ SDC data is a good example. De-normalized, e.g., County Name and FIPS, highly indexed, very fast and easy to use. FACILITY_TYPES LOCATIONS FACILITIES www.scribekey.com

Distinguish between Decision Support and Production Database Models !!! Use Both When Necessary Presentation Layer Presentation Layer Business Logic Middle Tier Layer – UML – OO Language No Middle Tier Data Access Layer Data Access Layer OLTP Database OLAP Database Production OLTP database solutions typically use a middle tier for representing higher level business objects and rules. This middle tier is often designed using UML and implemented with an Object Oriented programming language. Decision Support OLAP database solutions typically have no Middle tier. They present and access data directly through query language behind pivot tables and report generators. www.scribekey.com

Standardization, Modeling, and Mapping Close the gap between the source data and the target schema ABSTRACT SCHEMA REAL RAW DATA Use real data to inform and keep your modeling efforts focused Data Schema Gap Inconsistent Types Not Normalized No Domains Imperfect Strong Types Highly Normalized Lots of Domains Perfect Solution www.scribekey.com

Database Refactoring Approach Patterns approach, Gang of Four Book, great for new systems. Innovation: Martin Fowler, Code Refactoring, fix what you have Agile Database, Scott Ambler, Refactoring Databases You are not starting from scratch; need to make modifications to something which is being used. List of Refactorings www.scribekey.com

Sidebar: Relationship Discovery and Schema Matching: Entities, Attributes, and Domain Values Schema matching is necessary to discover and specify how categories, entities, attributes, and domains from one system map into another. Matching discovers relationships, Mapping specifies transforms These maps are stored in XML documents, FME, Pervasive, etc. As with metadata, it can be useful to store these in an RDB as well. www.scribekey.com

Schema Matching: Map & Gap • A Gap Analysis exercise can precede a full mapping to get a general idea on how two datasets relate. • Gap Analysis is always in a direction from one set of data elements to another. • Simple scores can be added to give an overall metric of how things relate. www.scribekey.com

Sidebar: Schema Matching Entities Is Also Important Multiple Hierarchical Feature Sets Node and Edge Networks = Multiple Geometric Representations Multiple Occurrences Multiple Locations Polygon Centroid Relationships Well documented schema matching information, as metadata, helps reduce and/or eliminate any confusion for integration developers and end users www.scribekey.com

Mechanics of ETL: The Heart of the Matter • Change Case • Add Primary Key • Add Foreign Key • Add Constraints, Use RDB • Split Table to N->1 • Pivot • Merge Tables • Remove Duplicates • Remove 1 to 1 Tables • Fill in Missing Values • Remove Empty Fields • Remove Redundant Fields • Verify with 2rd source • Change Name • Change Definition • Add Definition • Change Type • Change Length • Trim • Use List Domain • Use Range Domain • Split • Merge • Reformat • Use a Default • Create View www.scribekey.com

Use a Staging Data Store, Separate MR Source Keep a definitive snapshot copy of source, don’t change it. Execute ETL in a staging data store. Expect multiple iterations and temporary relaxed data types will be necessary MR Staging Don’t mix actual data with metadata repository information, keep separate databases Target Build final target dataset from staging data store. www.scribekey.com

Choosing the Right Tool(s) • SQL • FME • ESRI Model Builder • Pervasive (formerly Data Junction) • Microsoft SQL Server Integration Services • Oracle Warehouse Builder • Talend • C#/VB.NET/OLE-DB • Java/JDBC • Scripts: VB, JS, Python, Perl • Business Objects, Informatica Make the best use of the skills you have on your team. DB vs. code situations and teams. Use a combination of methods. www.scribekey.com

Sidebar: Use SQL Views to Simplify Read Only Data • SQL Views provide a flexible and easy mechanism for de-normalizing data coming from production databases. • Views are typically what the end user in a database application sees. They hide the multi-table complexity lying underneath in the physical model. • Think of the analysis database as a user that doesn’t need or want to see this underlying complexity. • Good approach for generating GIS parcel datasets from CAMA property records. • Can instantiate Views as Hard Tables, update on some regular basis bath SQL VIEW SQL MULTIPLE LINKED TABLES www.scribekey.com

ETL: SQL Data Manipulation • left(Address, instr(Address, ' ')) as AddressNum • left(POCName, instr(POCName, ' ')) as FirstName • right(POCName, len(POCName)-instr(POCName, ' ')) as LastName • int(numEmployees) as NumPersonnel • right(Address, len(Address)-instr(Address, ' ')) as StreetName • '(' & [Area Code] & ')' & '-' & Tel as Telephone • iif(instr(Tel, 'x')>0, right(Tel, len(Tel)-instr(Tel, 'x')), null) as TelExtension • ucase(Town) as City • iif(len(Zip)=5, null, right(Zip,4)) AS Zip4 • iif(len(Zip)=5, Zip, left(Zip,5)) AS Zip5 www.scribekey.com

ETL: Look Up tables • Clean and consistent domains are one of the most important things you can use to help improve data quality. • As example, consider use of single central master street list(s) for state, town, utility, etc. • One approach is to collect all of the variations in a single look up table and match them with the appropriate value from the master table. www.scribekey.com

ETL: Domain and List Cleaning Tools • There are powerful tools available to help match variation and master values. • Example: SQL Server Integration Services Fuzzy Look Up and Fuzzy Grouping Tools: http://msdn.microsoft.com/en-us/library/ms137786.aspx • These can be used to create easily reusable batch processing tools. • These list cleansing tools are often unfamiliar to geospatial data teams. • The saved match lists are also valuable for processing new data sets. www.scribekey.com

ETL: Regular Expressions • Regular Expressions provide a powerful means for validation and ETL, through sophisticated matching and replacement routines. • Example: Original Notes field has line feed control characters. We want to replace them with spaces. • Solution: Match “[\x00-\x1f]“ Replace With: “ “ • Need code as C#, VB, Python, etc. newVal = Regex.Replace(origVal, regExpMatch, regExpReplace); www.scribekey.com

ETL: Regular Expressions (cont.) Start and grow a list of regular expression match and replace values. Keep these in the Metadata Repository Need code as C#, VB, Python, etc. www.scribekey.com

ETL: Custom Code is Sometimes Required There is no simple SQL solution The problem is more complicated than simple name, type, or conditional value change You use preferred programming language and write custom ETL www.scribekey.com

ETL: Extending SQL • Most dialects of SQL; Oracle, Sql Server, Access, etc., allow you to develop custom functions with language like C#, Java, etc. • For example, you can build and use RegExpMatchAndReplace then use it in SQL • You can also add table look up, scripting, etc. • Very powerful and flexible approach • Example: UPDATE FACILITIES SET NOTES = REGEXP(“[\x00-\x1f]“ , “ “), FACILITY_TYPE = LOOKUP(THISVAL) www.scribekey.com

ETL: Use Geo Processing to Fill In Attributes • Example: We want to fill in missing Zip4 values to Facility points from polygon set. • This is particularly valuable for creating hierarchical roll-up/drill-down aggregation datasets • Use Arc Tool Box Spatial Join www.scribekey.com

ETL: Breaking Out N to 1 • This problem occurs very frequently when cleaning up datasets. • We have repeating columns to capture annual facility inspections. • This data should be pivoted and moved to another child table • We can use SQL and UNION capability to get what we want. • Can also reformat for consistency at the same time. www.scribekey.com

ETL: Use of Union To Pivot and Break Out SELECT Id as FacilityId, 2000 as Year, iif(ucase(Insp2000) = 'Y' or ucase(Insp2000) = 'T', 'Yes', Insp2000) as Inspected from AcmeFac UNION SELECT Id as FacilityId, 2005 as Year, iif(ucase(Inspect05) = 'Y' or ucase(Inspect05) = 'T', 'Yes', Inspect05) as Inspected from AcmeFac UNION SELECT Id as FacilityId, 2010 as Year, iif (ucase(Insp_2010) = 'Y' or ucase(Insp_2010) = 'Y', 'Yes', Insp_2010) as Inspected from AcmeFac www.scribekey.com

ETL: One-Off vs. Repeatable Transforms • Relying on manual tweaks and adjustments to completing and filling in correct data values is problematic if you need to set up repeatable ETL processes. • It’s much harder and more complicated to set up repeatable, ordered ETL routines, but well worth it if the data is going to be updated on an on-going basis. • Example: a dataset is cleaned with lots of SQL, scripts, manual tweaks, etc. When a newer dataset is made available, the same tasks need to be repeated, but the details and the order in which they were performed were not saved. • Suggestion: Be very aware of whether you are doing ETL as a one-off vs. something that will have to be repeated, and plan accordingly, save your work. www.scribekey.com

Bite the Bullet: Manual ETL • Sometimes the data problems are so tricky that you decide to do a manual clean up. • You could probably come up with code containing very large number of conditional tests and solutions (brain-teaser), but it would take longer than just cleaning the data by hand. • Depends on whether you are doing a one-off or need to build something for repeatable import. • This also applies to positioning geospatial features against a base map or ortho-image, e.g., after geocoding, etc. for populating x,y attributes. www.scribekey.com

Checking Against External Sources • In some cases the only way to fill in a missing value is to contact the original source of the data. • This can be highly granular and time consuming. • Need to make decision on how important it is to have 100% complete data. • This can be a case of diminishing returns. • One of the only ways to actually ensure that a list of feature values is complete is by checking against an external source. • In this case, the data in and of itself, does not necessarily provide a definitive version of the truth. • You can not tell what may be missing or what may be incorrectly included in the list. • Get redundant external information whenever it’s available. www.scribekey.com

Storing ETL Details in the Metadata Repository • The combination of the MR and code based tools provides a very flexible and powerful environment for improving data quality. • Many actions and parameters can be stored in the MR including LookUp, RegExp, SQL Select Clauses, and even runtime evaluation routines for code and scripts. • Example: Translator makes use of configurable ETL Map and Action table found in the Metadata Repository www.scribekey.com

ETL: Staging Table Design and SQL Command Types Source Target • Separate Source and Target tables, requires joins. • Can merge Source and Target into Staging table. • Decide what kind of complexity is preferred. • Can also split attributes and geometry, size factor, and use keys, initial insert, then updates, then recombine. • Build a single large SQL statement for creating view or hard table from results. 2 SEPARARATE TABLES, REQUIRES JOINS INSERT, UPDATE, SELECT Source Target 1 MERGED TABLE, HARDER TO CREATE, COLUMN NAMES UNIQUE NO JOINS UPDATE, THEN SELECT OUT www.scribekey.com

ETL: Loop Design: Row, Column, Value Processing Order • Row wise – SQL errors will fail for entire row, no values changed: UPDATE, INSERT, SELECT INTO • Column wise – Single UPDATE statement against all columns in the table, fast performance making use of database engine • Cell wise – UPDATE handles each Row.Column value individually, high granularity and control, slower performance ROW WISE COLUMN WISE CELL WISE www.scribekey.com

Finding Errors and Validation • Run a post-ETL Profile to see before and after data • Contents checker, which needs names, looks at actual values and checks against List Domains, Range Domains, and Regular Expressions • Output, describing data problems, is written to an audit table and is used to help find and fix errors. Audit info has table, key to lookup, column name, problem, and value. Question: How do you validate data after it has been transformed? www.scribekey.com

Finding and Fixing Data Quality Problems NEARC Fall, 2010 Newport, RI