850 likes | 1.14k Views
Data Quality Assessment and Measurement. Laura Sebastian-Coleman, Ph.D., IQCP Optum Data Management EDW April 2014 – AM5 April 28, 8:30 – 11:45. Agenda. Welcome and Thank You for attending! Agenda Introductory materials Abstract Information about Optum and about me
E N D
Data Quality Assessment and Measurement Laura Sebastian-Coleman, Ph.D., IQCP Optum Data Management EDW April 2014 – AM5 April 28, 8:30 – 11:45
Agenda • Welcome and Thank You for attending! • Agenda • Introductory materials • Abstract • Information about Optum and about me • Presentation sections will follow the outline in the abstract (details in a moment…) • Challenges of measuring data quality • DQ Assessment in context • Initial Assessment Deep Dive • Defining DQ Requirements • Using measurement for improvement • Discussion / Questions • Ground Rules • I will try to stick to the agenda. • But the purpose of being here is to learn from each other, so questions are welcome at any point. • I will balance between the two.
Abstract: Data Quality Assessment and Measurement • Experts agree that to improve data quality, you must be able to measure data quality. But determining what and how to measure is often challenging. The purpose of this tutorial is to provide participants with a comprehensive and adaptable approach to data quality assessment. • The challenges of measuring data quality and how to address them. • DQ assessment in context: Understand the goals and measurement activities and deliverables associated with initial assessment, in-line measurement and control, and periodic reassessment of data. Review a template for capturing results of data analysis from these processes. • Initial Assessment: Review an approach to initial assessment that allows capture of important observations about the condition of data. • Defining DQ requirements: Learn how to define measurable characteristics of data and establish requirements for data quality. Review a template designed to solicit and document clear expectations related to specific dimensions of quality. • Using measurement for improvement: Share examples of measurements that contribute to the ongoing improvement of data quality.
About Optum • Optum is a leading information and technology-enabled health services business dedicated to helping make the health system work better for everyone. • With more than 35,000 people worldwide, Optum delivers intelligent, integrated solutions that modernize the health system and help to improve overall population health. • Optum solutions and services are used at nearly every point in the health care system, from provider selection to diagnosis and treatment, and from network management, administration and payments to the innovation of better medications, therapies and procedures. • Optum clients and partners include those who promote wellness, treat patients, pay for care, conduct research and develop, manage and deliver medications. • With them, Optum is helping to improve the delivery, quality and cost effectiveness of health care.
About me • 10+ years experience in data quality in the health care industry • Have worked in banking, manufacturing, distribution, commercial insurance, and academia. These experiences have influenced my understanding of data, quality, and measurement. • Published Measuring Data Quality for Ongoing Improvement (2013). • Influences on my thinking about data: • The challenge of how to measure data quality. Addressing this challenge, I have focused on the concept of measurement itself. Any problem of measurement is a microcosm of the general challenge of data definition and collection. • The demands of data warehousing; specifically integrating data from different sources, processing it so that it is prepared for consumption, helping make it understandable • My thinking about data governance has been influenced by my position within an IT organization. • DAMA says governance is a business function. But I think IT needs to step up as well. • IT takes care of data. Technical and non-technical people would be better off it we all recognized IT as data stewards and if IT acted responsibly to steward data. • The quality of data (esp. in large data assets) depends on data management practices, which are IT’s responsibility. (It depends on other things, too, but data management is critical.) • Complex systems require monitoring and control to detect unexpected changes.
Overview: Challenges of Measuring Data Quality • Lack of consensus about the meaning of key concepts. Specifically, • Data • Data Quality • Measurement/Assessment • The only way to address a lack of consensus about meaning is to propose definitions and work toward consensus. In the next few slides, we will go into in depth about the meaning of these terms. • To start: Sometimes the term data quality is used to refer both to the condition of the data and to the activities necessary to support the production of high quality data. I separate these into • The quality of the data / the condition of data • Data quality activities: those required to produce and sustain high quality data • Lack of clear goals and deliverables for the data assessment process • These we will discuss in detail in DQ Assessment in Context. • Lack of a methodology for defining “requirements”, “expectations” and other criteria for the quality of data. These criteria are necessary for measurement. • This challenge we will discuss in detail in Defining Data Quality Requirements.
Assumptions about Data and Data Quality • In today’s world, data is both valuable and complex. • The processes and systems that produce data are also complex. • Many organizations struggle to get value out of their data because • They do not understand their data very well. • They do not trust the systems that produce it. • They think the quality of their data is poor – though they can rarely quantify data quality. • Poor data quality is not solely a technology problem – but we often • Blame technology for the condition of data and • And jump to the conclusion that tools can solve DQ problems. They don’t . • Technology is required to manage data and to automate DQ measurement – without automation, comprehensive measurement is not possible. There’s too much data. • Data is something people create. • It does not just exist out in the world to be collected or gathered. • To understand data requires understanding how data is created. • Poor data quality results from a combination of factors related to processes, communications, and systems within and between organizations.
Assumptions, continued…. • Given the importance of data in most organizations, ALL employees have a stewardship role, just as all employees have an obligation not to waste other resources. • Given how embedded data production is in non-technical processes, ALL employees contribute to the condition of data. • Raising awareness of how they contribute will help improve the quality of data. • Sustaining high quality data requires data management, not just technology management • Data management, like all forms of management, includes knowing what resources you have and using those resources to reach goals and meet objectives • Technology should be a servant, not a master; a means, not an end; a tail, not a dog. • Producing high quality data requires a combination of technical and business skills, (including management skills), knowledge, and vision. • No one can do it alone • Better data does not happen by magic. It takes work. • People make data. People can make better data. • Why don’t they?
What we want data to be Reasonable Reliable Rational Ready to use A bit technical, but basically comprehensible
How data sometimes seems Powerful. Packed with knowledge. But threatening. And ambiguous. And for those reasons, Interesting… And, of course, Somewhat magical Still…It is difficult to tell whose side data is on; whether it is good or evil.
What data seems to be turning into Big Data is BIG – Monstrous, even. And also powerful & threatening. Moving faster than we can control. Neither rational nor ready to use. And yet … a potential weapon. If only it would behave .
Definition: Data • Data’s Latin root is dare, past participle of to give. Data means “something given.” In math and engineering, the terms data and givens are used interchangeably. • The New Oxford American Dictionary (NOAD) defines data as “facts and statistics collected together for reference or analysis.” • ASQ defines data as “A set of collected facts” and identifies two kinds of numerical data: “measured or variable data … and counted or attribute data.” • ISO defines data as “re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing” (ISO 11179). • Observations about the concept of data • Data tries to tell the truth about the world (“facts”) • Data is formal – it has a shape • Data’s function is representational • Data is often about quantities, measurements, and other numeric representations “facts” • Things are done with data: reference, analysis, interpretation, processing
Data • Data: Abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. • Each piece is important • abstract representations – Not “reality” itself. • of selected characteristics of real-world objects, events, and concepts – Not every characteristic. • expressed through explicitly definable conventions related to their meaning, collection, and storage – Defined in ways that encodes meaning. Choices about how to encode are influenced by the ways that data will be created, used, stored, and accessed. • and understood through these conventions – Interpreted through de-coding. • These concepts are clearly at work in systems of measurement and in the scientific concept of data – something that you plan for (designing an experiment) and test for both veracity (are the measurements correct) and purpose (are the measurements telling me what I need to know).
Definition: Data Quality • Data Quality / Quality of Data: • The level of quality of data represents the degree to which data meets the expectations of data consumers, based on their intended use of the data. • Data also serves a semiotic function – it serves as a sign of something other than itself). So data quality is also directly related to the perception of how well data effects (brings about) this representation. • Observations: • High-quality data meets expectations for use and for representational effectiveness to a greater degree than low-quality data. • Assessing the quality of data requires understanding those expectations and determining the degree to which the data meets them. Assessment requires understanding • The concepts the data represents • The processes that created data • The systems through which the data is created • The known and potential uses of the data
Data Quality Activities • The data quality practitioner’s primary function is to help an organization improve and sustain the quality of its data so that it gets optimal value from its data. • Activities that improve and sustain data quality include: • Defining / documenting quality requirements for data • Measuring data to determine the degree to which data meets these requirements • Identifying and remediating root causes of data quality issues • Monitoring the quality of data in order to help sustain quality • Partnering with business process owners and technology owners to improve the production, storage, and use of an organization’s data • Advocating for and modeling a culture committed to quality • Assessment of the condition of data and ongoing measurement of that condition are central to the purpose of a data quality program.
Definition: Measurement • Measurement: The process of measurement is the act of ascertaining the size, amount, or degree of something. • Measuring always involves comparison. Measurements are the results of comparison. • Measurement most often includes a means to quantify the comparison. • Observation: Measurement is both simple and complex. • Simple because we do it all the time and our brains are hard-wired to understand unknown parts of our world in terms of things we know. • Complex because, for those things we have not measured before, we often do not have a basis for comparison, the tools to execute the comparison, or the knowledge to evaluate the results. • If you don’t believe me, imagine trying to understand “temperature” in a world without thermometers. • Measuring the quality of data is perceived as complex or difficult, because we often do not know what we can or should compare data against.
Assessment goes further than measurement Assessment is not just about comparison…it’s about drawing conclusions. Drawing conclusions depends on understanding implications and how to act on them.
Definition: Assessment • Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing. • Data quality assessment is the process of evaluating data to identify errors and understand their implications (Maydanchik, 2007). • Observations about assessment • Like measurement, assessmentrequires comparison. • Further, assessment implies drawing a conclusion about—evaluating—the object of the assessment, whereas measurement does not always imply so. • But as with data quality measurement, with data assessment, we do not always know what we are comparing data against. For example, how do we know what is wrong? What = an “error”?
Measurement/Assessment Measurement is knowing that the temperature outside is 30 degrees F below zero. Assessment is knowing that it’s cold outside. R You can act on the implications of an assessment: Get a coat! Or, better yet, stay inside.
Benefits of Measurement • Objective, repeatable way of characterizing the condition of the thing being measured. • For measurement to work, people must understand the meaning of the measurement. • A beginning point for change / improvement of the thing that needs improvement. • A means of confirming improvement has taken place.
Overview: DQ Assessment in Context • Goals: • Understand the goals and measurement activities and deliverables associated with • Initial assessment • In-line measurement and control • Periodic reassessment of data • Review a template for capturing results of data analysis from these processes. Order of information • Challenges of data quality assessment • Overview of the DQAF: Data Quality Assessment Framework • What the DQAF is • The Data Quality dimensions it includes • Relation of DQAF measurement types to data quality dimensions and to specific measurements • Objects of measurement and the data quality lifecycle • Context diagrams and deliverables • Template review
Data Quality Assessment • Ideally, data quality assessment enables you to describe the condition of data in relation to particular expectations, requirements, or purposes in order to draw a conclusion about whether it is suitable for those expectations, requirements, or purposes. • A big challenge: Few organizations articulate expectations related to the expected condition or quality of data. So at the beginning of an assessment process, these expectations may not be known or fully understood. The assessment process includes uncovering and defining expectations. • We envision the process as linear…. • But in most cases, it is iterative and sometimes requires multiple iterations….
Data Quality Assessment • Data assessment includes evaluation of how effectively data represent the objects, events, and concepts it is designed to represent. • If you cannot understand how the data works, it will appear to be of poor quality. • Data Assessment is usually conducted in relation to a set of dimensions of quality that can be used to guide the process, esp. in the absence of clear expectations: • How complete the data is • How well it conforms to defined rules for validity, integrity, and consistency • How it adheres to defined expectations for presentation • Deliverables from an assessment include observations, implications, and recommendations. • Observations: What you see • Implications: What it means • Recommendations: What to do about it
DQAF– Data Quality Assessment Framework • A descriptive taxonomy of measurement types designed to help people measure the quality of their data and use measurement results to manage data. • Conceptual and technology-independent (i.e., it is not a tool) • Generic – it can be applied to any data • Initially defined in 2009 by a multi-disciplinary team from Optum and UHC seeking to establish an effective approach for ongoing measurement of data quality. Basis for Measuring Data Quality for Ongoing Improvement. • Focuses on objective characteristics of data within five quality dimensions: • Completeness • Timeliness • Validity • Consistency • Integrity • Defines measurement types that • Measure characteristics important to most uses of data (i.e., related to the basic meaning of the data) • Represent a reasonable level of IT stewardship of data. That is, types that enable data management.
Using the DQAF • The intention of the DQAF was to provide a comprehensive description of ways to measure. I will describe it this way. • But it does not have to be applied comprehensively. • It can be applied to one attribute or rule. • The goal is to implement an optimal set of specific measurements in a specific system (i.e., Implementing all the types should never be the goal of any system). • Implementing an optimal set of specific measurements requires: • Understanding the criticality and risk of data within a system. • Associating critical data with measurement types. • Building the types that will best serve the system by • Providing data consumers a level of assurance to that data is sound based on defined expectations • Providing data management teams information that confirms that data moves through the system in expected condition
Using the DQAF • The different kinds of assessment are related to each other. • Initial assessment drives the process by separating data that meets expectations from data that does not and helping identify at risk and critical data for ongoing measurement. • Monitoring and periodic measurement identify data that may cease to meet expectations and data for which there are improvement opportunities. • The concept of data quality dimensions provides the initial organizing principle behind the DQAF: Data Quality Dimension: A data quality dimension is a general, measurable category for a distinctive characteristic (quality) possessed by data. Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. They allow understanding of quality in relation to a scale and in relation to other data measured against the same scale. Data quality dimensions can be used to define expectations (the standards against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset. Dimensions provide an understanding of why we measure. For example, to understand the level of completeness, validity, and integrity of data.
DQAF Terminology • Measurement Type: • Within the DQAF, a measurement type is a subcategory of a dimension of data quality that allows for a repeatable pattern of measurement to be executed against any data that fits the criteria required by the type, regardless of specific data content. • The measurement results of a particular measurement type can be stored in the same data structure regardless of the data content. • Measurement types describe how measurement are taken, including what data to collect, what comparisons to make, and how to identify anomalies. For example, all measurements of validity can be executed in the same way. Regardless of specific content, validity measurements include collection of data and comparison of values to a specified domain. • Specific Metric: • A specific metric describes particular data that is measured and the way in which it is measured. • Specific metrics describe what is measured. For example, a metric to measure the validity of procedure codes on a medical claim table. Or one to measure the validity of ZIP codes on a customer address table.
DQAF Dimension Definitions • Completeness: Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A data set is compete to the degree that it contains required attributes and a sufficient number of records, and to the degree attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the data set must be defined so that it includes all the attributes desired (width); the data set must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness would be measured differently. • Timeliness: Timeliness is a dimension of data quality related to the availability and currency of data. As used in the DQAF, timeliness is associated data delivery, availability, and processing. Timeliness is the degree to which data conforms to a schedule for being updated and made available. For data to be timely, it must be delivered according to schedule. • Validity: Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy, and correctness. Validity is degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain. • Consistency: A dimension of data quality. As used in the DQAF, consistency can be thought of as the absence of variety or change. Consistency is the degree to which data conform to an equivalent set of data, usually a set produced under similar conditions or a set produced by the same process over time. • Integrity: Integrity is a dimension of data quality. As used in the DQAF, integrity refers to the state of being whole and undivided or the condition of being unified. Integrity is degree to which data conform to data relationship rules (as defined by the data model) that are intended to ensure the complete, consistent, and valid presentation of data representing the same concepts. Integrity represents the internal consistency of a data set.
DQAF Terminology • Assessment Category: • In the DQAF, an assessment category is a way of grouping measurement types based on where in the data life cycle the assessment is likely to be taken. • Assessment categories pertain to both the frequency of the measurement (periodic or in-line) and the type of assessment involved (control, measurement, assessment). • They include: initial assessment, process control, in-line measurement, periodic measurement, and periodic assessment. • Measurement (or Assessment) Activities: • Measurement activities describe the goals and related actions related associated with work carried out within an assessment category. Measurement activities differ depending on when, within the data lifecycle, they are carried out and against which the objects of measurement. • Measurement activities correspond closely with DQAF measurement types. • Object of Measurement: • In the DQAF, objects of measurement are groupings of measurement types based on whether types focus on process or content, or on a particular part of a process (e.g., receipt of data) or kind of content (e.g., the data model). • Content-related objects of measurement include: The data model, content based on row counts, content of amount fields, date content, aggregated date content, summarized content, cross-table content (row counts, aggregated dates, amount fields, chronology), overall database content. • Process-related objects of measurement include: Receipt of data, Condition of data upon receipt, adherence to schedule, data processing
Functions in Assessment: Collect, Calculate, Compare, Conclude Use DQAF dimensions to help with this process and measurement types to help with this process
Results of Data Assessment • The following three slides associate deliverables from each of the measurement activities. • Through these deliverables…. • Metadata is produced, including: • Expectations related to the quality of data, based on dimensions of quality • Objective description of the condition of data compared to those expectations • Documentation of the relation of data’s condition to processes and systems – rules, risks, relationships • Data and process improvement opportunities can be identified and quantified, so that decisions can be made about which ones to address.
Initial Assessment: Capturing Observations and Conclusions about the Condition of Data
Deep Dive on Initial Assessment: Data Analysis Results Template • One of the challenges in data quality measurement is a lack of clear goals and deliverables for the data assessment process. I hope the preceding materials can help you clarify your goals for any measurement activities within an assessment. • Data Analysis Template should help you formulate your deliverable. • Components • Analysis Protocol Checklist • Observation Sheet • Supporting components – purpose and usage, content overview, definitions of terms, etc. • Summarized analysis questions • Show template now…
Analysis Protocol Checklist • A tool to enable analysts to execute data profiling in a consistent way. • Describes the actions that should be taken during any data analysis sequence. • Includes prompts and questions that help guide analysts in discovering potential risks within source data. • Although the list includes a set of discrete actions that can be described individually, many of these can be executed simultaneously; for example when reviewing the cardinality of a small set of valid values, analysts can and should be assessing the reasonability of the distribution of values. • The checklist ensure that nothing is missed when data is profiled.
Observation List • Designedto capture discreet, specific observations for knowledge sharing purposes. • Observations can be made at the column, table, file, or source level. • Observations will be used to inform other people about the condition of data and will be repurposed as metadata. Observations should be formulated with these ends in mind. • Each observation is recorded and associated with a relevancy category, so that its importance is understood and can be confirmed.