550 likes | 785 Views
Data Quality on the Semantic Web. Date: 1388/11/14. Table Of Contents. Data Quality Definition. Data Quality Dimensions. Data Quality Model. Quality of Linked Data. Data Quality Definitions. Some definition. data quality
E N D
Data Quality on the Semantic Web Behshid Behkamal Date: 1388/11/14
Table Of Contents Data Quality Definition Data Quality Dimensions Data Quality Model Quality of Linked Data Behshid Behkamal
Some definition • data quality degree to which the characteristics of data satisfy stated and implied needs when used under specified conditions • data quality characteristic category of data quality attributes that bears on data quality • data quality measure variable to which a value is assigned as the result of measurement of a data quality characteristic Behshid Behkamal
Classification of Data Quality problems Data Quality Problem Single Source Problem Multi Source Problem • Multi Source Problem • Single Source Problem Schema Related Schema Related Schema Related Schema Related Instant Specific Instant Specific Instant Specific Instant Specific Schema Related Schema Related Schema Related Instant Specific Instant Specific • Attribute • Record • Record Type • Source • Attribute • Record • Record Type • Source • Attribute • Record • Record Type • Source • Attribute • Record • Record Type • Source Behshid Behkamal
Single Source Problem - Schema Related Behshid Behkamal
Multi Source Problem - Schema Related Behshid Behkamal
Measuring Data Quality in Data Warehousing – 2001[1] Behshid Behkamal
Data quality Dimensions – 2003 [2] Task independent Reflect states of the data without the contextual knowledge of the application, and can be applied to any data set, regardless of the tasks at hand. Task dependent Which include the organization’s business rules, company and government regulations, and constraints provided by the database administrator, are developed in specific application contexts. Behshid Behkamal
Task Dependent Task Independent Behshid Behkamal
Dimension of Data Quality- 2005 [3] • Process:Dimensions of DQ related to the generation, assembly, description and maintenance of data - Reliability (with several sub dimensions), Metadata, Security and Confidentiality. • Data:Dimensions of DQ specifically associated with the data themselves. - Record/table level: Accuracy, Completeness, Consistency and Validity - Database level dimensions: Identifiably and Join ability. • User:Dimensions of DQ related to use and users - Accessibility, Interpretability,, Relevance and Timeliness. Behshid Behkamal
Dimension of Data quality – 2006 [4] • Depth of Data Quality • Accuracy • Completeness • Validity • Currentness • Width of Data Quality • Consistency • Integration Behshid Behkamal
Dimension of Data Quality – 2008 [5] • User Base Consistent representation, Interpretability, Case of understanding, Concise representation, Timeliness, Completeness, Value-added, relevance, appropriate, Meaningfulness, Lack of confusion, Arrangement, Readable, Reasonable • System Data Deficiency, Design Deficiencies, Operation Deficiencies • Inherent IQ Accuracy Cost, Objectivity, Believability, Reputation, Accessibility, Correctness, Unambiguous, Consistency • Intuitive Precision, Reliability, freedom from bias Behshid Behkamal
ISO/IEC 25012 Data Quality Model – 2008 [6] The ISO/IEC-25012 data quality model defined quality attributes into fifteen characteristics considered by two points of view: • Inherent data quality refers to data itself, in particular to: • data domain values and possible restrictions • relationships of data values • Metadata • system dependent data quality depends on the technological domain in which data are used: - computer systems' components such as: hardware devices (precision) - computer system software (recoverability) - other software (portability) Behshid Behkamal
Inherent data quality From the inherent point of view, data quality refers to data itself, in particular to: • data domain values and possible restrictions (e.g. business rules governing the quality required for the characteristic in a given application); • relationships of data values (e.g. consistency); • metadata. Behshid Behkamal
System dependent data quality System dependent data quality refers to the degree to which data quality is reached and preserved within a computer system when data is used under specified conditions. From this point of view data quality depends on the technological domain in which data are used; it is achieved by the capabilities of • computer systems' components such as: hardware devices (e.g. to make data available or to obtain the required precision), • Computer system software (e.g. backup software to achieve recoverability), • Other software (e.g. migration tools to achieve portability). Behshid Behkamal
1. Accuracy • The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use. • Syntactic accuracy • Semantic accuracy • Measurement Function A/B A: records in which all attributes are accurate B: Total records in a dataset A=number of records with the specified field syntactically accurate B=number of records A: attribute values that are accurate B: records × attributes Behshid Behkamal
2. Completeness • The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use. • Measurement Function A/B • A: records with no missing attribute • B: Total records in a dataset • A: number of data required for the particular context in the data file • B: number of data in the specified particular context of intended use • A: attribute fields containing values • B: records × attributes Behshid Behkamal
3. Consistency • Free from contradiction and are coherent with other data in a specific context of use. • A particular case of inconsistency is represented by synonyms: a dictionary of terms used to define data could be useful to avoid it. • EXAMPLE An employee's birth date cannot be later than his “recruitment date”. Behshid Behkamal
4. Creditability (validity) • Validity is a weakened but more readily measured form of accuracy. • Attribute values may be valid without being correct, but not vice versa. • An attribute value is valid if it falls in some external sources defined and domain-knowledge dependent set of values. Validity can range from • mechanical (Example:18/19/2002 is not a well-formed and not a valid date) • Logical (Example: -5 is not a valid age) • Domain-derived (Example: 1234 pounds is not a valid weight for a person) • Task dependent: 16:12 may be a valid time in one database but not in another Behshid Behkamal
5. Currentness • The degree to which data has attributes that are of the right age in a specific context of use. • EXAMPLE The timetable of a railway station must be updated with the frequency required to allow passengers to take a train even if the scheduled time or platform change. Behshid Behkamal
6. Accessibility • The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability. • EXAMPLE Data that should be managed by a screen reader cannot be stored as an image. • Inherent Data Quality Measure for Sound data accessibility • Measurement Function A/B A= number of data stored only as “sound” (e.g. without a textual representation of sound) B= number of data values representing a sound • System Dependent Data Quality Measure for Multi channel data accessibility • Measurement Function A/B A=Number of data that the differently able user successfully accesses B=Number of data available Behshid Behkamal
7. Compliance • The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use. • EXAMPLE: Credit risk data of a bank must comply with specific laws and standards. • Inherent Data Quality Measure for Privacy law non-conformity: values • Measurement Function A A=number of items that do not conform to privacy law statements due to data content • System Dependent Data Quality Measure for Privacy law non-conformity: architecture • Measurement Function A A=number of items that do not conform to privacy law statements due to technical architecture failures Behshid Behkamal
8. Confidentiality • Ensure that it is only accessible and interpretable by authorized users in a specific context of use. • Confidentiality is an aspect of information security (together with availability, integrity) as defined in • ISO/IEC 13335-1:2004. • EXAMPLE: Data that refers to personal or confidential information like health or profit must be accessed only by authorized users or should be written in secret code. • InherentData Quality Measure for Encryption usage • Measurement Function A/B A= Number of database fields encrypted B=Number of fields with an encryption requisite • System Dependent Data Quality Measure for Non vulnerability • Measurement Function 1- A/B A=number of successful penetrations during formal penetration tests B=number of penetration attempted Behshid Behkamal
9. Efficiency • The degree to which data has attributes that can be processed and provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use. • EXAMPLE: Using more space than necessary to store data can cause waste of storage, memory and time. • InherentData Quality Measure for Numbers stored as strings • Measurement Function A A=number of data stored as strings • System Dependent Data Quality Measure for Wasted space • Measurement Function Σ(B - A) A=benchmarked average space for efficient data storage of a database B=used space for data in any physical files of the database Behshid Behkamal
10. Precision • The degree to which data has attributes that are exact or that provide discrimination in a specific context of use. • Look for rounding errors. Exp. precision of 5 decimal places allows different functionalities rather than a precision of 2 decimal places • Precision in location latitude and longitude declarations: must contain seconds in the Degree/Minute/Second system. • Inherent Data Quality Measure Name Precision of data values • Measurement Function A/B A=number of data values with the requested precision B=total number of data values • System Dependent Data Quality Measure for Precision of fields of a database • Measurement Function A/B A=Number of data fields of the database defined with the requested precision B=total number of data fields of the database Behshid Behkamal
11. Traceability • Provide an audit trail of access to the data and of any changes made to the data in a specific context of use. • EXAMPLE: Public administrations must keep information about the access executed by users for investigating who read/wrote confidential data. • Inherent Data Quality Measure for Traceability of values • Measurement Function A/B A=Number of data for which required traceability of values is available B=number of data items for which traceability is tested • System Dependent Data Quality Measure for Automatic traceability • Measurement Function A A=number of data items traced automatically (using system capabilities) Behshid Behkamal
12. Understand ability • Enable data it to be read and interpreted by users, and are expressed in appropriate languages, symbols and units in a specific context of use. • Some information about data understandability are provided by metadata. • EXAMPLE: To represent a State (within a country), the standard acronym is more understandable than a numeric code. • Inherent Data Quality Measure for Master data understandability due to existing metadata • Measurement Function A/B A=Number of data of master data files with existing metadata B=Number of data of master data files • System Dependent Data Quality Measure for Master data understandability due to linked metadata • Measurement Function A/B A=Number of fields having metadata automatically linked to related data B=Total number of fields Behshid Behkamal
13. Availability • Enable data to be retrieved by authorized users and/or applications in a specific context of use. • A particular case of availability is concurrent access (both to read or to update data) by more than one user and/or application. • Another case of availability is the capability of data to be available for a specific period of time. • SYSTEM DEPENDENT Data Quality Measure for Data items availability • Measurement Function A/B A=Number of data items available during backup/restore activities B=Number of data items of backup/restore procedures Behshid Behkamal
14. Portability • Enable data to be installed, replaced or moved from one system to another preserving the existing quality in a specific context of use. • SYSTEM DEPENDENT Data Quality Measure for Data portability • Measurement Function A/B A=number of data that preserved the existing quality attribute after the migration to a different computer system B=number of data migrated Behshid Behkamal
15. Recoverability • Enable data to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use. • Recoverability can be provided by features like commit/synch point, rollback (fault-tolerance capability) or by backup-recovery mechanisms. • EXAMPLE: When a media device has a failure, data stored in that device should be recoverable. • SYSTEM DEPENDENT Data Quality Measure for Recoverability • Measurement Function A/B A= number of data items successfully backed up/restored during backup /restore operation B= number of data items of backup/restore procedures Behshid Behkamal
Creditability (or validity) [3] Measurement Function A/B A: records for which all entries are valid B: Total records in a dataset [5] Measurement Function A/B A= Number of data certified by internal audit after obtaining credit risk information data B=Number of data used to obtain credit risk information [6] Measurement Function A/B A: attribute values that are valid B: records × attributes [7] Look for artificial keys, identity values, system generated keys and apply at least one business key to a data grouping say in a data mart or row occurrence for a registry type data group (an inventory list like list of persons, list of vehicles etc) Behshid Behkamal
Understand ability [5] Measurement Function A/B A=Number of data of master data files with existing metadata B=Number of data of master data files [7] Look for lack of referential integrity on the use of same attributes being used in various tables Look for loss of history data with no record of previous values Behshid Behkamal
Understand ability according to Ref#2 • Look for consistency of business types that an organization is licensed for and related types of returns or transactional consistencies • Look for lack of referential integrity on the use of same attributes being used in various tables • Applicable to uniquely traceable items like serial numbers or particular licensed item identifiers, look for can the same item be involved with another item at the same time. • Applies to ownership, involvement, and lineage. • Look for loss of history data with no record of previous values Behshid Behkamal
Linked Data Behshid Behkamal • The goal of Semantic Web or Web of Data: processing data directly or indirectly by machines • Linked Data provides the means to reach the goal • Refers to data published on the Web in such a way • It is machine-readable • Its meaning is explicitly defined • It is linked to other datasets • It can be linked to/from external datasets 39
Quality Characteristics of Linked Data According to Definition of Linked Data: • Compliance • HTTP URIs to identify resources • HTTP Protocol to retrieve resources • Understand ability • It is machine-readable • Its meaning is explicitly defined • Portability • RDF data model to represent resources (Any application that understands the model, can consume any data source published based on the model) • It can be linked to/from other datasets Behshid Behkamal
Classification of Quality characteristics in Linked Data • Inherent data quality • Accuracy • Validity • Precision • Context Related • Completeness • Currentness • System Dependent • Accessibility • Traceability • Recoverability • Availability • Efficiency • Confidentiality (Privacy Protection and Licensing in Linked Data) • Consistency • one of the most challenge in Linked Data is Data fusion Behshid Behkamal
Data Fusion Behshid Behkamal Process of integrating multiple data items representing the same real-world object into a single, consistent, and clean representation. 42
Co-reference • A single URI identifies more than one resource • Exp. A number of people in DBLP with the same name who are being incorrectly identified as being the same person. • Multiple URIs identify the same resource • Different datasets use their own URIs to identify the same resource. People and places are entities which suffer from URI multiplicity. • Exp. Spain has at least four URIs: • http://dbpedia.org/resource/Spain • http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain • http://sws.geonames.org/2510769 • http://www4.wiwiss.fuberlin.de/eurostat/resource/countries/Espa%C3%B1a Behshid Behkamal
Author Disambiguation [7] • Single author having multiple identities (variation in the spelling) • ‘Hugh Glaser’ • ‘H. Glaser’ • ‘Glaser, H.’ • Many authors who share the same name Behshid Behkamal
Author Disambiguation … • Solutions: • citation matching, • name matching, • Name equivalence identification • All of them involve some form of string matching and word sense disambiguation. • Help in identifying names with different spellings or written in different formats • Disambiguating authors with exactly the same name remains a challenge. Behshid Behkamal
Consistent Reference Services [8] • The CRS introduces the concept of a bundle to group together resources that have been deemed to refer to the same concept within a given context. • Different bundles may be used to group together URIs of the same resource in different contexts. • For example, there may be a bundle containing all of the URIs about a person in the context of institution 1; and another bundle containing all of the URIs about the same person in the context of institution 2. • Each CRS can use different algorithms to identify equivalent resources. Behshid Behkamal
An Entity Name System for Linking Semantic Web Data [9] • Entity Name System (ENS), might play for the Semantic Web the role that the DNS played for interlinking hypertexts on the Web. Behshid Behkamal
Interlinking Distributed Social Graphs [10] 1. Export social data contained within data silos into the same semantic form. (FaceBook, Twitter, MySpace ) 2. Link person instances from separate social networks referring to the same real world person. 3. Publish a decentralized linked social graph. Behshid Behkamal
References • Markus Helfert, Institute of Information Management, University of St. Gallen, Managing and Measuring Data Quality in Data Warehousing, 2001 • Leo L. Pipino, Yang W. Lee, and Richard Y. Wang, Data quality Assessment, 2003 • Alan F. Karr and Ashish P. Sanil , Data Quality: A Statistical Perspective, 2005 • Kyung-Seok Ryu, Joo-Seok Park, and Jae-Hong Park, A Data Quality Management Maturity Model, ETRI Journal, (2006) Vol. 28, No. 2, 191- 204 • Ying Su, Zhanming Jin, A Methodology for Information Quality Assessment in Data Warehousing, reviewed at the direction of IEEE Communications Society, Publication in the ICC 2008 proceedings. • ISO/IEC 25012 - Data Quality Model, Final Draft: 2008-11-04 • Afraz Jaffri, Hugh Glaser, Ian C. Millard, URI Disambiguation in the Context of Linked Data, LDOW2008, China. • Hugh Glaser, Afraz Jaffri, Ian C. Millard, Managing Co-reference on the Semantic Web, LDOW2009, Spain. • Paolo Bouquet, Heiko Stoermer, Daniele Cordioli, An Entity Name System for Linking Semantic Web Data, LDOW2008, China. • Matthew Rowe, Interlinking Distributed Social Graphs, LDOW2009, Spain. Behshid Behkamal
Thank You Behshid Behkamal