1 / 25

Towards a Process Oriented View on Statistical Data Quality

Explore new approaches towards data quality assessment through a process-oriented view. Understand the impact of different production steps on final product quality using a statistical workflow model. Learn about data integration techniques and quality assessment functions. Discover the blend of business and scientific workflows for effective quality evaluation.

markgriffin
Download Presentation

Towards a Process Oriented View on Statistical Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann

  2. Contents • Approaches Towards Data Quality • Example Data Integration • A Generic Statistical Workflow Model • Quality Assessment • Conclusions Grossmann, Denk

  3. Approaches Towards Data Quality • The usual approach towards data quality is the Reporting View • Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions • Some frequently used dimensions: • Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... Grossmann, Denk

  4. Approaches Towards Data Quality • These dimensions are many times broken down in sub-dimensions • Example Accuracy: • Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error, .... • Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom Grossmann, Denk

  5. Approaches Towards Data Quality • If we have a number of different opportunities for data production such an approach is probably not the best one • Compare the ideas of Total Quality Management (TQM) in industrial production: • Systematic treatment of the influence of different production steps on quality of the final product • We need a Processing View on data quality: How is data quality influenced by production? Grossmann, Denk

  6. Approaches Towards Data Quality • How can we arrive at a Processing View on data quality? • We need a statistical workflow model • We have to organize the processing information necessary for quality assessment in appropriate way • Compare (old) ideas of B. Sundgren about capture of metadata Grossmann, Denk

  7. Approaches Towards Data Quality • We have to know functions for assessing quality • Output_Quality = F(Input_Quality, Processing_Quality) • Such functions have to be applied according to • The object we are interested in, e.g. a variable or a population or a classification • The quality aspect we are interested in Grossmann, Denk

  8. Example Data Integration • Data integration occurs many times in statistical data production, in particular in case of data production from administrative sources • It uses a number of operations usually understood as data pre-processing • Basic goal: Combine information from two or more already existing data sets Grossmann, Denk

  9. Example Data Integration • Example for a Data Integration Dataflow • Input → Integration → Post-alignment Grossmann, Denk

  10. Example Data Integration • Top level task description • Match the datasets according matching key • Align V1 (gender) • Align V2 (status) Grossmann, Denk

  11. Example Data Integration • Details, Decisions to be made • Are datasets appropriate? • Quality of matching keys • Quality of data sources • Method for identification of matches? • Method for handling ambiguities in V1 (Gender)? • Method for imputation of V2 (Status)? • How is quality measured • At level of a summary measure? • At level of a specific variable? • At level of individual records? Grossmann, Denk

  12. Example Data Integration • There are no generally accepted standard tools and methods for answering such questions • Probably we have to compare a number of alternative approaches • Apply the generic format for different datasets • Try different statistical methods and models • Use different methods for quality assessment • Traditional formulas • Simulation based evaluation • Assessment by using strategic surveys Grossmann, Denk

  13. Example Data Integration • Conclusion • Different statistical methods may be an essential part of data production and quality assessment • There is no longer such a clear distinction between “objective” data collection and statistical analysis • Statistics generates added value beyond (administrative) accounting and IT Grossmann, Denk

  14. A Generic Statistical Workflow Model • Statistical Workflow: A mixture from • Business Workflow (Process oriented) • Scientific Workflow (Data oriented) • Quality evaluation is the main control element of the process • We have to consider the workflow at two levels • Meta-level (Control of the process) • Data-level (Production of data) Grossmann, Denk

  15. A Generic Statistical Workflow Model • Building blocks of the workflow model • Transformations (Basic data operations) • Process components (Tasks) defined by: • Task definition • Pre-Alignment • Feasibility Check • Main Transformation • Post-Alignment • Quality Evaluation • Workflow (Sequence of Process components) Grossmann, Denk

  16. A Generic Statistical Workflow Model • Example for Data Integration Component Workflow Grossmann, Denk

  17. A Generic Statistical Workflow Model • In order to understand how statistics influences the boxes and data quality let us zoom into the box for post-alignment Grossmann, Denk

  18. Quality Assessment • For quality assessment we need a detailed description of the changes in meta-information during the dataflow Grossmann, Denk

  19. Quality Assessment • Example for meta- information flow in data integration • Details for register based census in the presentation of Fiedler/Lenk in • Session 26 (Thursday) Grossmann, Denk

  20. Quality Assessment • Example: • Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example Grossmann, Denk

  21. Quality Assessment • V1 (Gender) • Input • Coincidence of matching keys in both datasets • Matching of the variable Gender in both datasets • Beliefs about quality of the variable in both sources • Accuracy Assessment • It seems that models developed in decision analysis (calculus from belief networks) are appropriate • Alternatively we can use a strategic sample to check whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments Grossmann, Denk

  22. Quality Assessment • V2 (Status): • Input • Coincidence of matching keys in both datasets • Reliability of the model used for imputation • Measurement technique for quality of imputation • Accuracy Assessment • In this case we can apply traditional statistical techniques like false classification rate, ROC-curve, simulation Grossmann, Denk

  23. Conclusions • We have presented a model, which allows tighter coupling of quality assessment to the data production process • Such a model seems useful if data production has more degrees of freedom • What data should be used? • What techniques should be used • The approach allows identification of the different factors influencing quality Grossmann, Denk

  24. Conclusions • It allows formulation of precise questions about possible alternatives and defines new issues for research in statistical data quality • Hopefully it helps to understand better the added value generated by statistics Grossmann, Denk

  25. Thank you for attention Grossmann, Denk

More Related