Data Profiling and Quality Issues

Data Profiling and Quality Issues Alka Vaidya NIBM

Data Profiling • “Begin at the beginning” regardless of the state of the information within the enterprise • It is a fundamental step that should begin every data-driven initiative • A proactive approach to understanding the data • It discovers the data present in your organization and the characteristics of the data • It also gives you insight into your business processes and refine them over time

Data Profiling Defined • A process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to: • find out whether existing data can easily be used for other purposes • give metrics on data quality including whether the data conforms to company standards • assess the risk involved in integrating data for new applications • track data quality • assess whether metadata accurately describes the actual values in the source database • understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. • have an enterprise view of all data, for applications such as Data Warehousing

Data Profiling Techniques • Structure Discovery • Does your data match corresponding metadata? • Does the data adhere to proper uniqueness and null value constraints? • Data Discovery • Are the data values complete, accurate and unambiguous? • Relationship Discovery • Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables?

Structure Discovery • Validation with Metadata • If data and metadata do not conform to each other, it may have far-reaching implications. E.g. Consider a table with 10M records, a particular field is char(255) and the longest field is 200 characters in length, you are wasting approximately 550MB of disk space • Missing values in fields that should not have missing values can cause joins to fail

Pattern Matching • Typically, it determines if the data values in a field are in the expected format. It can validate that the values in the field are consistent across the data source • It will also tell you if a field is all numeric, if a field has consistent lengths • E.g. Phone Numbers

Basic Statistics • One can learn a lot by reviewing a basic statistics about the data, especially numeric data • Reviewing statistics such as min/max/s.d. etc. can give you insight into validity of data

Content Discovery • This can help in validating rules and assessing data completeness • Contents discovery techniques include • Standardization • Frequency counts and outlier detection • Business Rule Validation

Relationship Discovery • This technique highlights potential key relationships across tables • It helps you understand how data sources interact with other data sources • It point out key violations

Data Profiling and Quality Issues