Quality and Repair

Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011

Work Plan View WP2 24 12 0 6 18 30 36 42 48 D2.1 Conceptualmodelandbestpracticesfor high-quality datapublishing D2.2 Methodsforqualityrepair D2.6 Methods for assessing the quality of sensor data Task 2.1 Data quality assessment and repair D2.4 Update of D2.1 FUB D2.3 Modelling and processing contextual aspects of data D2.5 Proof-of-concept evaluation for modelling space and time Task 2.2 Temporal, spatial and social aspects of data KIT D2.7 Recommendations for contextual data publishing Task 2.3 Recommendations for enhancing best practices for data publishing KIT

Upcoming deliverables • Quality Assessment • D2.1 - Conceptual model and best practices for high-quality metadata publishing • Quality Enhancement • D2.2 - Methods for quality repair

Outline • Overview of Quality • Data Quality Framework • Quality Assessment • Quality Enhancement (Repair)

Quality “Fitness for use.” Joseph Juran. The Quality Control Handbook. McGraw-Hill, New York, 3rd edition, 1974.

Data Quality • Multifaceted • accurate = high quality? • availability? • timeliness? • Subjective • weekly updates are ok. • Task-dependent • task: weather forecast • data is not good if it is not available for online query • vacation planning or aviation? • for me, for vacation planning

Data Quality Dimensions Presentation order

Data Quality Framework Quality Enhancement Quality Assessment

Dereferenceability ACCESSIBILITY • Indicator: Dereferenceable URIs • “Resources identified by URIs that respond with RDF to HTTP requests?” • Metrics: • for datasets (d) and for resources (r) • deref(d) = count(r | deref(r)) • ratioderef(d) = deref(d) / no-deref(r) • Recommendation: • Your URIs should be dereferenceable. • Prefer reusing URIs that are dereferenceable.

Access methods ACCESSIBILITY • Indicator: Access methods • “Data is accessible in varied and recommended ways.” • Metrics: • sample(d): {0,1} “example resource available for d” • endpoint(d): {0,1} “SPARQL endpoint available for d” • dump(d): {0,1} “RDF dumps available for d” • Recommendation: • Provide as many access methods as possible • A sample resource provides a quick view into the type of data you serve. • SPARQL endpoints for clients to obtain part of the data • Dumps are cheaper than alternatives when bulk access is needed

Availability ACCESSIBILITY • Indicator: Availability • “Average availability in time interval” • Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24 • Alternatively, httphead() instead of deref() • Recommendation: • the higher the better

Accessiblity Dimensions ACCESSIBILITY • Dereferenceability • Availability • Access methods • Response time • Robustness • Reachability • http GET / HEAD • hourly derefs • URI, Bulk, SPARQL • timed deref • requests per minute • LOD cloud inlinks Examples:

Representational: Interpretability REPRESENTATIONAL • Indicator: Human/Machine interpretability • “URI is dereferenceable to human and machine readable formats” • Metrics: • format(deref(r,f)) in {Fh U Fm} : {0,1} • Fh = HTML, XHTML+RDFa, ...: {0,1} • Fm = NT, RDF/XML, ...: {0,1} • Recommendation: • Resources should dereference at least to human-readable HTML and one widely adopted RDF serialization.

Vocabulary understandability REPRESENTATIONAL • Schema understandability • “Schema terms are familiar to existing agents.” • Metrics: • vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D) • Alt: Page Rank (prob. that random surfer has found v) • Recommendation: • Reuse widely deployed vocabularies.

Representational Dimensions REPRESENTATIONAL • Human/Machine Interpretability • Vocabulary Understandability • Representational Conciseness • HTML, RDF • Vocabulary usage stats • Triples / Byte

Contextual Dimensions CONTEXTUAL DIMENSIONS • Completeness • Full set of objects and attributes wrt to a task • Conciseness • Amount of duplicate entries, redundant attributes • Coherence • How well instance data conforms to schema

Contextual Dimensions CONTEXTUAL DIMENSIONS • Verifiability • How easy it is to check the data? • Can use provenance information. • Validity • Encodes context- or application-specific requirements

Intrinsic Dimensions INTRINSIC DIMENSIONS • Accuracy • usually estimated; may be available for sensors • Timeliness • can use last update • Consistency • two or more values do not conflict with each other • Objectivity • Can be traced via provenance

Example: AEMET • Metadata entry:http://thedatahub.org/dataset/aemet • Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl • Access methods: Example URI, SPARQL, Bulk • Availability: • Example URI: available • SPARQL Endpoint: 100% • Format Interpretability: • TTL=OK • RDF/XML=OK • Verifiability: • Published by third party http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet

Data Quality Framework Quality Enhancement Quality Assessment

Validity as a Quality Indicator • Validity is an important quality indicator • Encodes context- or application-specific requirements • Applications may be useless over invalid data • Binary concept (valid/invalid) • Two steps to guarantee validity (repair process): • Identifying invalid ontologies (diagnosis) • Detecting invalidities in an automated manner • Subtask of Quality Assessment • Remove invalidities (repair) • Repairing invalidities in an automated manner • Subtask of Quality Enhancement

Diagnosis • Expressing validity using validity rules over an adequate relational schema • Examples: • Properties must have a unique domain • p Prop(p)  a Dom(p,a) • p,a,b Dom(p,a)  Dom(p,b)  (a=b) • Correct classification in property instances • x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) • x,y,p,a P_Inst(x,y,p)  Rng(p,a)  C_Inst(y,a) • Diagnosis reduced to relational queries

geo:location SpatialThing Sensor Observation Schema Data Item1 ST1 Example Ontology O0 Class(Sensor), Class(SpatialThing), Class(Observation) Prop(geo:location) Dom(geo:location,Sensor) Rng(geo:location,SpatialThing) Inst(Item1), Inst(ST1) P_Inst(Item1,ST1,geo:location) C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing) • Correct classification in property instances • x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) Item1 geo:location ST1 Sensor is the domain of geo:location Item1 is not a Sensor P_Inst(Item1,ST1,geo:location)O0 Dom(geo:location,Sensor)O0 C_Inst(Item1,Sensor)O0 • Remove P_Inst(Item1,ST1,geo:location) • Remove Dom(geo:location,Sensor) • Add C_Inst(Item1,Sensor)

Preferences for Repair • Which repairing option is best? • Ontology engineer determines that via preferences • Specified by ontology engineer beforehand • High-level “specifications” for the ideal repair • Serve as “instructions” to determine the preferred solution

O1 O0 O2 O3 Preferences (On Ontologies) Score: 3 Score: 4 Score: 6

O1 O0 O2 O3 Preferences (On Deltas) -P_Inst (Item1,ST1, geo:location) Score: 2 -Dom (geo:location,Sensor) Score: 4 +C_Inst (Item1,Sensor) Score: 5

Preferences • Preferences on ontologies are result-oriented • Consider the quality of the repair result • Ignore the impact of repair • Popular options: prefer newest information, prefer trustable information • Preferences on deltas are more impact-oriented • Consider the impact of repair • Ignore the quality of the repair result • Popular options: minimize schema changes, minimize addition/deletion of information, minimize delta size • Two sides of the same coin (equivalent options) • Quality metrics can be used for stating preferences • Metadata on the data may be needed • Can be qualitative or quantitative

Generalizing the Approach • For one violated constraint • Diagnose invalidity • Determine minimal ways to resolve it • Determine and return preferred resolution • For many violated constraints • Problem becomes more complicated • More than one resolution steps are required • Issues: • Resolution order • When and how to filter non-preferred solutions? • Constraint (and resolution) interdependencies

Constraint Interdependencies • A given resolution may: • Cause other violations (bad) • Resolve other violations (good) • Cannot pre-determine the best resolution • Difficult to predict the ramifications of each one • Exhaustive search required • Recursive, tree-based search (resolution tree) • Two ways to create the resolution tree • Globally-preferred (GP), locally-preferred (LP) • When and how to filter non-preferred solutions?

Resolution Tree Creation (GP) Find all minimal resolutions for all the violated constraints, then find the preferred ones Globally-preferred (GP) • Find all minimal resolutions for one violation • Explore them all • Repeat recursively until consistent • Return the preferred leaves Preferred repairs (returned)

Resolution Tree Creation (LP) Find the minimal and preferred resolutions for one violated constraint, then repeat for the next Locally-preferred (LP) • Find all minimal resolutions for one violation • Explore the preferred one(s) • Repeat recursively until consistent • Return all remaining leaves Preferred repair (returned)

Characteristicsof GP Exhaustive Less efficient: large resolution trees Always returns most preferred repairs Insensitive to constraint syntax Does not depend on resolution order Characteristicsof LP Greedy More efficient: small resolution trees Does not always return most preferred repairs Sensitive to constraint syntax Depends on resolution order Comparison (GP versus LP)

Algorithm and Complexity • Detailed complexity analysis for GP/LP and various different types of constraints and preferences • Inherently difficult problem • Exponential complexity (in general) • Main exception: LP is polynomial (in special cases) • Theoretical complexity is misleading as to the actual performance of the algorithms

Performance in Practice • Performance in practice • Linear with respect to ontology size • Linear with respect to tree size • Types of violated constraints (tree width) • Number of violations (tree height) – causes the exponential blowup • Constraint interdependencies (tree height) • Preference (for LP): affects pruning (tree width) • Further performance improvement • Use optimizations • Use LP with restrictive preference

Evaluation Parameters • Evaluation • Effect of ontology size (for GP/LP) • Effect of tree size (for GP/LP) • Effect of violations (for GP/LP) • Effect of preference (relevant for LP only) • Quality of LP repairs • Preliminary results support our claims: • Linear with respect to ontology size • Linear with respect to tree size

Publications • Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011 • Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012

Outlook • Continue refining model based on experience with data sets catalog • Derive “best practices checks” from metrics • Results of quality assessment to be added to next release of the catalog • Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework • Finalize experiments for Data Repair

Quality and Repair