Using Provenance for Quality Assessment and Repair in Linked Open Data

Using Provenance for Quality Assessment and Repair in Linked Open Data Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki Publication at EvoDyn-12

Setting and General Idea • Linked Open Data cloud • Uncontrolled • Vast • Unstructured • Dynamic • Datasets interrelated, fused etc • Quality problems emerge • Assessment (measure quality) • Repair (improve quality)

ES FR EN GE PT Motivating Example • User seeks information on Brazilian cities • Fuses Wikipedia dumps from different languages • Guarantees maximal coverage, but may lead to conflicts • E.g., city with two different population counts

Main Tasks • Assess the quality of the resulting dataset • Framework for associating data with its quality • Repair the resulting dataset • By removing one of the conflicting values (i.e., one of the conflicting population counts) • How to determine which value to keep? • Solution: use heuristics • Here, we evaluate the use of provenance-related heuristics • Prefer most recent information • Prefer most trustworthy information

Contributions • Emphasis on provenance • Assessment metrics (done) • Heuristics for repair (done, but does not support metadata information) • Contributions: • Extend repair algorithm to support heuristics on metadata • Define 5 different metrics based on provenance • Used for both assessment and repair • Evaluate them in a real setting

Quality Assessment • Quality = “fitness for use” • Multi-dimensional, multi-faceted, context-dependent • Methodology for quality assessment • Dimensions • Aspects of quality • Accuracy, completeness, timeliness, … • Indicators • Metadata values for measuring dimensions • Last modification date (related to timeliness) • Scoring Functions • Functions to quantify quality indicators • Days since last modification date • Metrics • Measures of dimensions (result of scoring function) • Can be combined • We use this framework to define our metrics

Quality Repair (Setting) • Focus on validity (quality dimension) • Encodes context- or application-specific requirements • Applications may be useless over invalid data • Binary concept (valid/invalid) • Generic

Quality Repair (Rules) • Rules determine validity • Expressive • Disjunctive Embedded Dependencies (DEDs) • Cause interdependencies • Resolution causes/resolves other violations • Difficult to foresee ramifications of repairing choices • User cannot make the selection alone

Quality Repair (Preferences) • Selection is done automatically, according to a set of user-defined specifications • Which repairing option is best? • Ontology engineer determines that via preferences • Specified by ontology engineer beforehand • High-level “specifications” for the ideal repair • Serve as “instructions” to determine the preferred solution for repair • Highly expressive

Quality Repair (Extensions) • Existing work on repair is limited • Provenance cannot be considered for preferences • Assessment metrics based on provenance cannot be exploited • Extensions are needed (and provided) • Metadata (including provenance) can be used in preferences • Preferences can apply on both repairs and repairing options • Formal details omitted (see paper)

Experiments (Setting) • Setting taken from the motivating example • Fused 5 Wikipedias: EN, PT, SP, GE, FR • Distilled information about Brazilian cities • Properties considered: • populationTotal • areaTotal • foundingDate • Validity rules: properties must be functional • Repaired invalidities (using our metrics) • Checked quality of result • Dimensions: consistency, validity, conciseness, completeness and accuracy

Metrics for Experiments (1/2) • PREFER_PT: select conflicting information based on its source (PT>EN>SP>GE>FR) • PREFER_RECENT: select conflicting information based on its recency (most recent is preferred) • PLAUSIBLE_PT: ignore “irrational” data (population<500, area<300km2, founding date<1500AD) otherwise use PREFER_PT

Metrics for Experiments (2/2) • WEIGHTED_RECENT: select based on recency, but in cases where the records are almost equally recent, use source reputation (if less than 3 months apart, use PREFER_PT, else use PREFER_RECENT) • CONDITIONAL_PT: define source trustworthiness depending on data values (prefer PT for small cities with population<500.000, prefer EN for the rest)

Consistency, Validity • Consistency • Lack of conflicting triples • Guaranteed to be perfect (by the repairing algorithm), regardless of preference • Validity • Lack of rule violations • Coincides with consistency for this example • Guaranteed to be perfect (by the repairing algorithm), regardless of preference

Conciseness, Completeness • Conciseness • No duplicates in the final result • Guaranteed to be perfect (by the fuse process), regardless of preference • Completeness • Coverage of information • Improved by fusion • Unaffected by our algorithm • Input completeness = output completeness, regardless of preference • Measured to be at 77,02%

Accuracy • Most important metric for this experiment • Accuracy • Closeness to the “actual state of affairs” • Affected by the repairing choices • Compared repair with the Gold Standard • Taken from an official and independent data source (IBGE)

Accuracy Evaluation … fr.dbpedia en.dbpedia pt.dbpedia Instituto Brasileiro de Geografia e Estatística(IBGE) Fuse/Repair dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate Gold Standard integrated data Compare Accuracy

Accuracy Examples • City of Aracati • Population: 69159/69616 (conflicting) • Record in Gold Standard: 69159 • Good choice: 69159 • Bad choice: 69616 • City of Oiapoque • Population: 20226/20426 (conflicting) • Record in Gold Standard: 20509 • Optimal approximation choice: 20426 • Sub-optimal approximation choice: 20226

Accuracy Results

Accuracy of Input and Output

Conclusion • Quality assessment and repair of LOD • Evaluated a set of sophisticated, provenance-inspired metrics for: • Assessing quality • Repairing conflicts • Used in a specific experimental setting • Results are necessarily application-specific • THANK YOU!

Using Provenance for Quality Assessment and Repair in Linked Open Data

Using Provenance for Quality Assessment and Repair in Linked Open Data

Presentation Transcript

Provenance Aware Linked Sensor Data

Utilising Linked Open Data in Applications

Granularity in Library Linked Open Data

Open Data Linked Data Big Data

Quality Assessment Methodologies for Linked Open Data

Linked Open Data stuff

Linked Open Data for New Modernist Studies

Linked Justifications: Provenance Aware Data Integration on Linked Data

Libraries and linked open data

OpenEI and Linked Open Data

Linked Open Library Data @hbz

Linked Data at present Using Linked Data

Linked Data and the Provenance Explosion

Linked Open Data in the Humanities

Visualizing Linked Open Data

(Open and Linked Data in Local Government)

Opening SDI Metadata for Linked Open Data

Linked Open Data Cloud