220 likes | 511 Views
Beyond Accuracy: What Data Quality Means to Data Consumers. CMPT 455/826 - Week 1, Day 1 (based on R.Y. Wang & D.M. Strong ). Basic Premise. Many data-bases are not error-free Data quality problems, go beyond accuracy to include other aspects such as completeness and accessibility.
E N D
Beyond Accuracy: What Data QualityMeans to Data Consumers CMPT 455/826 - Week 1, Day 1 (based on R.Y. Wang & D.M. Strong) Sept - Dec 2009 - w1d1
Basic Premise • Many data-bases are not error-free • Data quality problems, • go beyond accuracy to include other aspects such as completeness and accessibility Sept - Dec 2009 - w1d1
Data Quality • The authors define "data quality" as • data that are fit for use by data consumers • Challenge: isn’t that “quality data” rather than data quality? There are many problems with the use and misuse of the term quality. [further discussion of this problem on next slide] Sept - Dec 2009 - w1d1
Quality • Unfortunately "quality" is a word that has many meanings depending on a person's perspective. • When “quality” is used as a noun it refers to some attribute or feature of a thing without regard to any evaluation of whether that attribute is good or bad. Systems may be described in terms of an infinite number of noun qualities. • When “quality” is used as an adjective it refers to a favorable evaluation of the thing to which it refers. There are an infinite number of bases for evaluating adjectival qualities. Despite all being favorable, some of the types of adjectival quality do not have an objective basis. The quality of a given object may not be quantifiable without relating it to the quality of some other object. Sept - Dec 2009 - w1d1
Data Quality Dimensions • The authors define a "data quality dimension" as • a set of data quality attributes • that represent a single aspect or construct • of data quality. • This may include some data quality attributes, such as: • accuracy, timeliness, precision, reliability, currency, completeness, and relevance, accessibility and interpretability • Please note that the “data quality dimension” is something beyond other important data dimensions we are already familiar with: • “data value” - the data that is actually stored in our database • “data format” – the data structure that is used by our database to store the data value Sept - Dec 2009 - w1d1
Dimensions and Attributes • Opportunity: The authors assume that the reader understands the distinction between dimensions and attributes. • Attributes: • are defined in the data definition of a database • contain identifiable components of the data in the database • are something that you should all be used to (before taking this class) Sept - Dec 2009 - w1d1
Dimensions and Attributes • Dimensions: • help us to organize the data • by organizing data based on general concepts • e.g. location, customers, products, finances, time • help us recognize similar purposes for the data • by involving / combining different attributes of data • NOTE: different attributes may have different granularities e.g. a “location” dimension can include attributes: city, province, country • NOTE: some attributes may work in combination, e.g. 1st and last names • may have further characteristics such as: • their own data about the data (which is referred to as “meta-data”) • their own particular structure, ordering, and/or (sub)dimensions • potentially sharing data/attributes with other dimensions Sept - Dec 2009 - w1d1
Dimensions and Attributes • Dimensions are a/the MAJOR FOCUS of this course • so if they are not clear yet, don’t worry • but if they are not clear by the end of the course, then you should worry • Now back to our consideration of this introductory paper • in this consideration, please note all the different possible concepts that we should consider along with the data itself Sept - Dec 2009 - w1d1
Hypothesis • Their preliminary conceptual framework for data quality: • The data must be accessible to the data consumer. • For example, the consumer knows how to retrieve the data. • The consumer must be able to interpret the data. • For example, the data are not represented in a foreign language. • The data must be relevant to the consumer. • For example, data are relevant and timely for use by the data consumer in the decision-making process. • The consumer must find the data accurate. • For example, the data are correct, objective and come from reputable sources Sept - Dec 2009 - w1d1
Quality Framework • Challenge: The authors did not research major relevant quality frameworks. • ISO 9126-1 Software engineering – Software product quality – Quality characteristics and sub-characteristics • “categorizes the attributes of software quality into six characteristics, which are further subdivided”: • Functionality • Reliability • Usability • Efficiency • Maintainability • Portability Sept - Dec 2009 - w1d1
Functionality • “the capability of the software to provide functions which meet stated and implied needs when the software is used under specified conditions.” [ISO 9126-1] • includes: • suitability, which evaluates how system functions meet the needs of user tasks • accuracy, which evaluates the achievement of the right results • interoperability, which evaluates interactions with other systems • security, which evaluates the ability of the system to withstand unauthorized accesses and modifications Sept - Dec 2009 - w1d1
Reliability • “the capability of the software to maintain the level of performance of the system when used under specified conditions”. [ISO 9126-1] • includes: • maturity, which evaluates the ability of the system to avoid failures, regardless of any faults it has • fault tolerance, which evaluates the capability of the system to maintain a suitable level of performance in spite of faults or other difficulties • recoverability, which evaluates the ability of the system to recover its data and performance after a failure Sept - Dec 2009 - w1d1
Usability • “the capability of the software to be understood, learned, used and liked by the user, when used under specified conditions”. [ISO 9126-1] • includes: • understandability, which evaluates the ability of users to understand how, when, and where to use the system, • learnability, which evaluates the ability (including the effort required) for users to learn how to use the system, • operability, which evaluates the ability of the product to be used and controlled by the user, • attractiveness, which evaluates the ability of the product to be “liked” by users. Sept - Dec 2009 - w1d1
Efficiency • “the capability of the software to provide the required performance, relative to the amount of resources used, under stated conditions”. [ISO 9126-1] • includes: • time behaviour, which evaluates the appropriateness of response and processing times of the system, • resource utilization, which evaluates the use of resources in performing system functions. Sept - Dec 2009 - w1d1
Maintainability • “ the capability of the software to be modified”. [ISO 9126-1] • includes: • analysability, which evaluates the ability to identify problems in the system, • changeability, which evaluates the ability to implement modifications to the system, • stability, which evaluates the ability to minimize undesired side effects of modifications, • testability, which evaluates the ability to validate modified software. Sept - Dec 2009 - w1d1
Portability • “the capability of software to be transferred from one environment to another”. [ISO 9126-1] • includes: • adaptability, which evaluates the ability to modify software via features rather than reprogramming to meet the needs of different environments, • installability, which evaluates the ability to install software in a given environment, • co-existence, which evaluates the ability of the software to share common resources with other installed software, • replaceability, which evaluates the ability of software to replace other software. Sept - Dec 2009 - w1d1
Quality characteristics • The “Quality characteristics and sub-characteristics” of ISO 9126-1 • are a number of sub-dimensions of the data quality dimension • So are the various “data quality attributes” of the authors • (accuracy, timeliness, precision, reliability, currency, completeness, and relevance, accessibility and interpretability) • A “dimension" only becomes an attribute when it is recorded with the data • (as meta data that can be used computationally) • It is important to try to be precise in what we are saying • That way we can help clarify all these concepts Sept - Dec 2009 - w1d1
On being precise • English is a very imprecise language • and it is very possible for different people to have different expectations of the same concept • e.g. ISO 9241-11 has a very different definition of “usability” from ISO 9126-1 • ? guess which one I use more regularly • Most people expect data to be precise • There are problems when it is not what we expect • Given a weather forecast for a high of 30 think how a Canadian and an American will dress • But given a forecast for 30F how will they dress? • Sometimes we need metadata to help interpret data Sept - Dec 2009 - w1d1
Their “Research” • 1st survey identified 179 attributes • 2nd survey was analyzed by factor analysis to group attributes into 20 “intermediate dimensions” • Then they moved these 20 into the 4 components of their hypothesised framework • Finally they revised the names of 2 of their 4 framework components Sept - Dec 2009 - w1d1
So why this paper? • Not because of its (dubious) research methodology • Where “research data” is forced into preconceived hypothesis • Where quality attributes are investigated out of any specific context • This paper • Identifies many different concerns regarding information • Including the need to contextualize it • Demonstrates that we need to develop approaches to help • Design for quality data (whatever that means) • Identify the qualities that are important to our users • Justify (and then evaluate) our efforts at achieving quality • Provides a basis for examples of challenges and opportunities Sept - Dec 2009 - w1d1
What about future papers? • All of the papers for this course • have some good points and some failings (like we all do) • are designed to make you think • can help you to develop better data / information / knowledge systems • But none of the papers • have all the answers – or – • are a how to cookbook • So we have to work to figure out how to apply them Sept - Dec 2009 - w1d1