250 likes | 258 Views
Explore new approaches towards data quality assessment through a process-oriented view. Understand the impact of different production steps on final product quality using a statistical workflow model. Learn about data integration techniques and quality assessment functions. Discover the blend of business and scientific workflows for effective quality evaluation.
E N D
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann
Contents • Approaches Towards Data Quality • Example Data Integration • A Generic Statistical Workflow Model • Quality Assessment • Conclusions Grossmann, Denk
Approaches Towards Data Quality • The usual approach towards data quality is the Reporting View • Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions • Some frequently used dimensions: • Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... Grossmann, Denk
Approaches Towards Data Quality • These dimensions are many times broken down in sub-dimensions • Example Accuracy: • Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error, .... • Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom Grossmann, Denk
Approaches Towards Data Quality • If we have a number of different opportunities for data production such an approach is probably not the best one • Compare the ideas of Total Quality Management (TQM) in industrial production: • Systematic treatment of the influence of different production steps on quality of the final product • We need a Processing View on data quality: How is data quality influenced by production? Grossmann, Denk
Approaches Towards Data Quality • How can we arrive at a Processing View on data quality? • We need a statistical workflow model • We have to organize the processing information necessary for quality assessment in appropriate way • Compare (old) ideas of B. Sundgren about capture of metadata Grossmann, Denk
Approaches Towards Data Quality • We have to know functions for assessing quality • Output_Quality = F(Input_Quality, Processing_Quality) • Such functions have to be applied according to • The object we are interested in, e.g. a variable or a population or a classification • The quality aspect we are interested in Grossmann, Denk
Example Data Integration • Data integration occurs many times in statistical data production, in particular in case of data production from administrative sources • It uses a number of operations usually understood as data pre-processing • Basic goal: Combine information from two or more already existing data sets Grossmann, Denk
Example Data Integration • Example for a Data Integration Dataflow • Input → Integration → Post-alignment Grossmann, Denk
Example Data Integration • Top level task description • Match the datasets according matching key • Align V1 (gender) • Align V2 (status) Grossmann, Denk
Example Data Integration • Details, Decisions to be made • Are datasets appropriate? • Quality of matching keys • Quality of data sources • Method for identification of matches? • Method for handling ambiguities in V1 (Gender)? • Method for imputation of V2 (Status)? • How is quality measured • At level of a summary measure? • At level of a specific variable? • At level of individual records? Grossmann, Denk
Example Data Integration • There are no generally accepted standard tools and methods for answering such questions • Probably we have to compare a number of alternative approaches • Apply the generic format for different datasets • Try different statistical methods and models • Use different methods for quality assessment • Traditional formulas • Simulation based evaluation • Assessment by using strategic surveys Grossmann, Denk
Example Data Integration • Conclusion • Different statistical methods may be an essential part of data production and quality assessment • There is no longer such a clear distinction between “objective” data collection and statistical analysis • Statistics generates added value beyond (administrative) accounting and IT Grossmann, Denk
A Generic Statistical Workflow Model • Statistical Workflow: A mixture from • Business Workflow (Process oriented) • Scientific Workflow (Data oriented) • Quality evaluation is the main control element of the process • We have to consider the workflow at two levels • Meta-level (Control of the process) • Data-level (Production of data) Grossmann, Denk
A Generic Statistical Workflow Model • Building blocks of the workflow model • Transformations (Basic data operations) • Process components (Tasks) defined by: • Task definition • Pre-Alignment • Feasibility Check • Main Transformation • Post-Alignment • Quality Evaluation • Workflow (Sequence of Process components) Grossmann, Denk
A Generic Statistical Workflow Model • Example for Data Integration Component Workflow Grossmann, Denk
A Generic Statistical Workflow Model • In order to understand how statistics influences the boxes and data quality let us zoom into the box for post-alignment Grossmann, Denk
Quality Assessment • For quality assessment we need a detailed description of the changes in meta-information during the dataflow Grossmann, Denk
Quality Assessment • Example for meta- information flow in data integration • Details for register based census in the presentation of Fiedler/Lenk in • Session 26 (Thursday) Grossmann, Denk
Quality Assessment • Example: • Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example Grossmann, Denk
Quality Assessment • V1 (Gender) • Input • Coincidence of matching keys in both datasets • Matching of the variable Gender in both datasets • Beliefs about quality of the variable in both sources • Accuracy Assessment • It seems that models developed in decision analysis (calculus from belief networks) are appropriate • Alternatively we can use a strategic sample to check whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments Grossmann, Denk
Quality Assessment • V2 (Status): • Input • Coincidence of matching keys in both datasets • Reliability of the model used for imputation • Measurement technique for quality of imputation • Accuracy Assessment • In this case we can apply traditional statistical techniques like false classification rate, ROC-curve, simulation Grossmann, Denk
Conclusions • We have presented a model, which allows tighter coupling of quality assessment to the data production process • Such a model seems useful if data production has more degrees of freedom • What data should be used? • What techniques should be used • The approach allows identification of the different factors influencing quality Grossmann, Denk
Conclusions • It allows formulation of precise questions about possible alternatives and defines new issues for research in statistical data quality • Hopefully it helps to understand better the added value generated by statistics Grossmann, Denk
Thank you for attention Grossmann, Denk