420 likes | 610 Views
A Framework for Exploring Data Quality in a Large Data System. Willard Hom Institute on Research & Statistics April 8, 2004. Problem : How can we plan to explore data quality in a large data system?.
E N D
A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004
Problem: How can we plan to explore data quality in a large data system? Basic Response: Match the needs for data quality exploration with your resources/contexts for doing so.
Note on Topic Coverage • Focus on traditional MIS data (numeric and string variables). • Not covered are other data forms (audio, visual, GIS, and narrative text/reports). • MIS staff and expertise are obviously critical so this talk focuses mostly on contributions researchers can make.
Reasons to Explore Data Quality • Meet professional duty as researchers. • Help us to judge the types of analysis that we can do with a data system. • Prevent others from misusing data. • Facilitate the improvement of data. • Counter frequent myths about your data. • Help justify the agency’s mission & funding (esp. when data is necessary for funding).
Why Exploring Data Quality Is Hard for Organizations • Inexperience in the topic (and lack of expertise). • Concrete added expense but no clear value. • Sunken costs (incl. pressure for a time series). • Finding errors will not mean perfect data will result. • Political & administrative sensitivity. • Not usually mandated.
Some Dimensions of Data Quality • Accuracy.* • Completeness.* • Consistency. • Currency. • Accessibility (Usability).* areas where researchers or statisticans can contribute most effectively
Accuracy • Closeness to “true” value of a variable. • Unbiased (absence of systematic error). • Level of precision (or “coarseness ” of the recorded values).
Completeness • The degree of coverage (in terms of “cases” and of “variables”) for the analysis of a “target population.”* • No, or very few, missing values where true values exist.*also issues relating to longitudinal studies and explanatory modeling.
Consistency • Equivalence of “instrumentation” and formatting across time, space, and other “batches” of the data collection/management environment.
Currency • Minimal lag time between occurrence of new phenomenon and the availability of data values in the system to represent the new phenomenon. • Minimal lag time between discovery of errors and the correction of those errors in historical data.
Accessibility/Usability • Ease of manipulation by target users (file format, record format, field format, system compatibility) • Clarity of metadata for proper data analysis. • Breadth of access (authority for use).
Some Factors in Choosing Which Variables to Explore • Risk from errors in a variable. • Ease of error detection for a variable. • Ease of error correction for a variable. • Cost of error detection for a variable. • Cost of error correction for a variable.
Two Basic Data Exploration Tracks • Data editing and testing. • Process analysis. • Both are important to use.
Data Editing/Testing • Screen for allowable range of values. • Screen for outliers (univariate or multivariate). • Statistical quality control methods. • Screen within a record or across records. • These can prevent some error and can detect some error, but they rarely find root causes.
Some Caveats for Editing/Testing • Some outliers are true values while some inliers are not. (Error detection is complicated.) • Testing depends upon the analyst’s ability to use some “gold standard” in a comparison. • Variables with restricted range of measurement or on a categorical scale present a different challenge than outliers for variables using an interval scale (with no range restriction or truncation).
Process Analysis • Analyze each step in the data’s history ---which includes the initial data generating step and all ensuing steps in the data processing---right down to the user of a final report or analysis.
Some Caveats for Process Analysis • This is a multi-disciplinary concept (needing at least MIS, social scientists, and subject matter experts). • This can cost far more (in time and resources) than the editing/testing track. • The metadata factor is important here. • It is critical for finding the root cause of data error.
Administrative Issues in Data Error • Publish data so that data originators can correct errors in the system (a feedback loop)---another benefit of data “usability.” • Consider the incentives that data creators or intermediaries have to bias data on purpose (so alter the incentives or monitor closely). • Consider factors that can motivate the production of more accurate and complete data (show that data get used and how the costs of error will hit them)---especially when lack of effort is the cause.
How Researchers Can Add Value • Models in social psychology and economics to understand a data generation/processing system. • Field observation (and interviews) of data collection process. • Verbal protocol methods for process actors. • Experimentation to develop improvements. • Statistical tools for sampling (incl. audit sampling), outliers, odd data patterns, control charts, and validity/reliability studies.
Some Examples of Exploration • Determining the number of CC students in an academic year with a bachelor’s degree.* • Validating the students’ self-reported goals for CC enrollment (at time of initial registration).* • Checking the accuracy of a flag for first-time CC student.*** hypothetical example** actual historical example
Determining the number of CC students in an academic year with a bachelor’s degree. • A differential fee for CC students who have a BA/BS could motivate students to misreport the prior attainment of a BA/BS. • A partial test for potential reporting bias could use databases of higher ed. enrollment to check for degree status in a random sample of CC students. • We could use the sample proportion (in lieu of the “population” proportion in the MIS) if the MIS proportion lies outside the sample’s 95% confidence interval.
Validating the students’ self-reported goals for CC enrollment (at time of initial registration). • Student-reported goals may lack validity if students give the question no cognitive effort. • A sample of CC students could be re-interviewed (phone or face-to-face) to check the reliablity of the initial response (a case of test-retest reliability). • A qualitative evaluation could use field observation and/or post-survey de-briefing. • Researchers could use the verbal protocol method to understand the ways that students interpret the question.
Checking the accuracy of a flag for first-time CC student . • CC students mark their status as “first-time CC students” or some other category, but field staff noted apparent reporting errors. • The state-wide MIS has records of CC enrollment by individual student over a span of years. • Programmers checked each new cohort of CC students for any prior CC enrollment.
Some Pitfalls in Data Exploration • If cases lack unique identifiers, you can’t use another data source to cross-check for data agreement. • Even if cross-referencing indicates disagreement in data values, we may not know which source, if either of them, has the correct values. • To find coverage errors (target population errors), you need alternate data and a statistical analysis of population profiles (because total N’s may agree). • Survey data demand special methods such as re-interviewing and instrument validation.
Rule of Thumb 1: Do data exploration as near to the data generating step as possible; this will help in achieving correct data.
Reasons for RoT 1: • As time and proximity from the data source increase, the chances for getting correct data decrease. • As more time passes and advancement into a data system occurs, the risk of dispersion of bad data grows---making amelioration more difficult.
Rule of Thumb 2: It’s impossible to achieve perfect data: seek to find the levels of quality that are critical.
Reasons for RoT 2: • In large data systems, some loss of quality is inevitable. • Usually, we cannot afford to achieve perfect data. • Usually, we can achieve analytical goals with less-than-perfect data.
Rule of Thumb 3: With limited resources, we will need to trade-off breadth for depth in data exploration.
Reasons for RoT 3: • In-depth data exploration takes time and expertise—which agencies usually have in limited supply.
Rule of Thumb 4: Expertise in statistical analysis and research in the relevant subject area are indispensable to effective data quality exploration (in concert with MIS staff) .
Reasons for RoT 4: • Staff who only have MIS expertise (with no expertise in statistical analysis or the relevant research topic) :1. cannot fully understand the level of data quality (accuracy and completeness) needed, and2. cannot use the various statistical methods to detect potentially erroneous data.
How Critical Is Data Quality for Your System? • Are the fates of clients dependent on data quality? • Is program funding or program evaluation directly linked to your data? • Does your job depend upon data quality? (If your data are poor, will it be outsourced?)
Can You Document Your Current Data Quality? • Do you have a system in place that prevents data errors? • Do you have a system in place that measures data quality (and detects error)? • How rigorous are your steps to prevent error and measures of data quality?
Is There A Credibility Gap? • Do analysts/decision-makers downplay the reports or conclusions that are based on your data system? • Do analysts/decision-makers prefer alternate data sources when they draw their conclusions?
What Are Your Capacities? • Does your agency have close control over the data system (that is, “cradle-to-grave”)? • Does your agency have researchers with the skill/education to explore system data quality? • Does your agency have MIS staff with skill/education to explore system data quality? • Does your agency have time, staff availability, and funds to undertake data quality exploration? • Is the data system a stable, long-term operation?
Is There Management Support? • Does management want a short-term solution---basically a “defensive” agenda---just find ways to rebut criticisms of your data’s quality? • Does management want a long-term solution to data quality issues---a comprehensive strategy to prevent error and to raise quality?
P.S. Some Factors in Initial Data Quality Problems • Researchers may not have an active role in the design of data systems (an administrative and political issue)---if the agency has qualified researchers at all. • Researchers may not have enough time, tools, or special training to help plan valuable outputs that a proposed data system could deliver. • Analytical needs change but systems often do not adapt well to emerging needs or environmental changes.
“Nutshell” Bibliography • Dasu, T. & T. Johnson. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons: New York. • Iglewicz, B. & D.C. Hoaglin. (1993). How to Detect and Handle Outliers. American Society for Quality Control: Milwaukee, Wisconsin.
“Nutshell” Bibliography (cont.) • Naus, J.I. (1975). Data Quality Control and Editing. Marcel Dekker: New York. • Olson, J.E. (2003). Data Quality: The Accuracy Dimension. Morgan Kaufmann: San Francisco. • Redman, T.C. (1992). Data Quality: Management and Technology. Bantam: New York.
Willard Hom • Director of Research & Planning UnitChancellor’s Office, California Community Colleges, 1102 Q Street, Sacramento, CA 95814-6511 • E-mail: whom@cccco.edu • Phone: (916) 327-5887