860 likes | 995 Views
Linked Data Integration (using reasoning). Aidan Hogan. Day 3 Session 2. What is reasoning?. Reasoning: Conceptual Overview. (Loosely) Deriving novel conclusions from existing knowledge Deductive reasoning : inferring new facts from existing rules and facts
E N D
Linked Data Integration (using reasoning) Aidan Hogan Day 3 Session 2
Reasoning: Conceptual Overview (Loosely) Deriving novel conclusions from existing knowledge Deductive reasoning: inferring new facts from existing rules and facts Given rule: All Kia cars are made in Korea; Given premise (fact): Fred’s car is a Kia Entails fact:Fred’s car was made in Korea Inductive reasoning: learning new rules from existing facts and entailments (typically what us humans do: build imprecise rules from models) Given model of existing facts: All Kia cars I’ve seen have four wheels ~Entails rule: All Kias have four wheels Given fact:Fred’s car is a Kia Entails (probabilistic fact): Fred’s car likely has four wheels Abductive reasoning: guess a premise from a conclusion (similar in principle to a form of inductive reasoning) Given entailment: Fred’s car is Korean and has four wheels? Rule: Kias and Hyundais are Korean and typically have four wheels Guessed premise:Fred’s car is a Kia or a Hyundai?
Reasoning: Clearing up terms • Semantics: formally defined meanings of terms • KiaCar ⊑ KoreanCar ⊓ FourWheels • Entailments: the conclusions which follow from formal semantics • Inference: a procedure to compute entailments • (or the entailments that can be computed therefrom)
Reasoning: Example Conceptual Tasks • Conjunctive Query Answering:generate new answers for questions • Include Fred as an answer to “give me friends of mine who own Korean cars” • Subsumption checking: identify subclass relationships • Is the class KiaCar a subset of KoreanCar if all Kias are manufactured in Seoul? • Class-Satisfiability checking: identify if a class can have a membership • Can there be something which is a KoreanCar and a EuropeanCar? • Consistency checking: identify formally conflicting information • Fred tells me his Kia is European; is this correct? • Instance checking: identify if an individual is a member of a given class • Is Freds Kia an Asian Car?
Reasoning: RDFS and OWL (deductive) • Formal semantics of RDFS and OWL can be leveraged for reasoning. • :KiaCar rdfs:subClassOf :KoreanCar , • [ owl:hasValue :Seoul ; owl:onProperty :manufacturedIn ] • :FredsCar a :KiaCar . • Implies • :FredsCar a KoreanCar , :manufacturedIn :Seoul .
Reasoning: OWL (2) • Eight sub-languages of OWL! • Why eight?(^^) • Direct Semantics (based on Description Logics): • OWL DL (NExpTime for non-QA tasks), OWL Lite (ExpTime for non-QA) • OWL 2 EL, OWL 2 QL, OWL 2 RL (All PTime for most tasks for non-QA) • OWL 2 RL (2NExpTime for non-QA) • Emphasis on soundness and completeness • Tableaux-based algorithms • Based on KB-satisfiability checking • Syntactic restrictions to preserve complexity • e.g., no datatype inverse-functional properties • RDF-Based Semantics (layered directly on top of RDFS) • OWL Full, OWL 2 Full • All tasks are undecidable! • No complete, correct inference procedure can exist for the reasoning tasks • Incomplete reasoning possible through rules… Opinion: OWL/OWL 2 useful stuff, but an extremely complex standard!!!
RDFS and OWL 2 RL: Entailment rules • RDFS entailment rules provide sound, complete RDFS reasoning • OWL 2 RL/RDF provide partial support for OWL 2 RDF-based semantics • Monotonic rules which are guarded • Positive subset of datalog with a fixed ternary predicate • Rules have cubic complexity (with trivial exceptions aside) • Due to the arity of triples (3)
Rules IF⇒THEN Body/Antecedent/Condition Head/Consequent ?c1 rdfs:subClassOf ?c2 . ?x rdf:type ?c1 . ⇒?x rdf:type ?c2 . • foaf:Person rdfs:subClassOf foaf:Agent . • timbl:me rdf:type foaf:Person . • ⇒timbl:me rdf:type foaf:Agent . Schema/Terminology/ Ontological Instance/Assertional
Rules (Inconsistencies [a.k.a. Contradictions]) IF⇒THEN Body/Antecedent/Condition Head/Consequent ?c1 owl:disjointWith ?c2 . ?x rdf:type ?c1 . ?x rdf:type ?c2 . ⇒false • foaf:Person owl:disjointWith foaf:Organization . • w3c rdf:type foaf:Organization . • w3c rdf:type foaf:Person . • ⇒false
Linked Data reasoning? …integration use-case!
Linked Data Reasoning explicit data implicit data How can consumers query the implicit data
…so what’s The Problem?… …heterogeneity …need to integrate data from different sources
Take Query Answering… foaf:page Gimmewebpages relating to Tim Berners-Lee timbl:i timbl:ifoaf:page?pages .
Hetereogenity inschema… webpage: properties = rdfs:subPropertyOf mo:musicBrainz = owl:inverseOf doap:homepage mo:myspace … foaf:homepage foaf:weblog foaf:primaryTopic foaf:isPrimaryTopicOf foaf:page foaf:topic
Linked Data, RDFS and OWL: Linked Vocabularies … SKOS … Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman
Hetereogenity in naming… Tim Berners-Lee: URIs dblp:100007 timbl:i db:Tim-Berners_Lee identica:45563 = owl:sameAs … fb:en.tim_berners-lee adv:timbl
Returning to our simple query… mo:myspace foaf:primaryTopic foaf:page SKOS foaf:topic doap:homepage foaf:homepage Gimmewebpages relating to Tim Berners-Lee foaf:isPrimaryTopicOf identica:45563 adv:timbl db:Tim-Berners_Lee dblp:100007 fb:en.tim_berners-lee timbl:i timbl:ifoaf:page?pages . ...7 x 6 = 42 possible patterns
Challenges… …what (OWL) reasoning is feasible for Linked Data?
Linked Data Reasoning: Challenges Scalable Expressive Domain-Agnostic Robust
Linked Data Reasoning: Challenges • Scalability • At least tens of billions of statements (for the moment) • Near linear scale!!! • Noisy data • Inconsistencies galore • Publishing errors
What about noise? … …need to consider the provenance of Web data
Noisy Data: Omnipotent Being • Web data is noisy. • Proof: • 08445a31a78661b5c746feff39a9db6e4e2cc5cf • sha1-sum of ‘mailto:’ • common value for foaf:mbox_sha1sum • An inverse-functional (uniquely identifying) property!!! • Any person who shares the same value will be considered the same • Q.E.D.
Noisy Data: Redefining everything • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0) • rdf:type rdf:type owl:Property . • rdf:type rdfs:label “type”@en . • rdf:type rdfs:comment “Type of resource” . • rdf:type rdfs:domain eiao:testRun . • rdf:type rdfs:domain eiao:pageSurvey . • rdf:type rdfs:domain eiao:siteSurvey . • rdf:type rdfs:domain eiao:scenario . • rdf:type rdfs:domain eiao:rangeLocation . • rdf:type rdfs:domain eiao:startPointer . • rdf:type rdfs:domain eiao:endPointer . • rdf:type rdfs:domain eiao:header . • rdf:type rdfs:domain eiao:runs .
Noisy Data: Inconsistency w3c rdf:type foaf:Organization . w3c rdf:type foaf:Person . foaf:Person owl:disjointWith foaf:Organization .
Consider source of schema data Class/property URIs dereference to their authoritative document FOAF spec authoritative for foaf:Person✓ MY spec not authoritative for foaf:Person✘ Allow “extension” in third-party documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓ BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘ ALSO: Protect specifications foaf:knows a owl:SymmetricProperty . (MY spec) ✘ AuthoritativeReasoning
Noisy Data: Redefining everything • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0) • rdf:type rdf:type owl:Property . • rdf:type rdfs:label “type”@en . • rdf:type rdfs:comment “Type of resource” . • rdf:type rdfs:domain eiao:testRun . • rdf:type rdfs:domain eiao:pageSurvey . • rdf:type rdfs:domain eiao:siteSurvey . • rdf:type rdfs:domain eiao:scenario . • rdf:type rdfs:domain eiao:rangeLocation . • rdf:type rdfs:domain eiao:startPointer . • rdf:type rdfs:domain eiao:endPointer . • rdf:type rdfs:domain eiao:header . • rdf:type rdfs:domain eiao:runs . Not Authoritative 29
Authoritative Reasoning: read more …w/ essential plugs Gong Cheng, Yuzhong Qu. "Integrating Lightweight Reasoning into Class-Based Query Refinement for Object Search." ASWC 2008. Aidan Hogan, Andreas Harth, Axel Polleres. "Scalable Authoritative OWL Reasoning for the Web." IJSWIS 2009. Aidan Hogan, Jeff Z. Pan, Axel Polleres and Stefan Decker. "SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples." ISWC 2010. My thesis: http://aidanhogan.com/docs/thesis/
Alternative to Authoritative Reasoning? • Quarantined reasoning! • Separate and cache hierarchy of schema documents/dependencies… 31
Quarantined Reasoning [Delbru et al.; 2008] A-Box / Instance Data (e.g, a FOAF file) T-Box / Ontology Data (e.g., the FOAF ontology and its indirect imports) 35
Noisy Data: Redefining everything • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0) • rdf:type rdf:type owl:Property . • rdf:type rdfs:label “type”@en . • rdf:type rdfs:comment “Type of resource” . • rdf:type rdfs:domain eiao:testRun . • rdf:type rdfs:domain eiao:pageSurvey . • rdf:type rdfs:domain eiao:siteSurvey . • rdf:type rdfs:domain eiao:scenario . • rdf:type rdfs:domain eiao:rangeLocation . • rdf:type rdfs:domain eiao:startPointer . • rdf:type rdfs:domain eiao:endPointer . • rdf:type rdfs:domain eiao:header . • rdf:type rdfs:domain eiao:runs . Not In Here
Quarantined Reasoning: read more R. Delbru, A. Polleres, G. Tummarello and S. Decker. "Context Dependent Reasoning for Semantic Documents in Sindice. “ 4th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2008.
Resolving Inconsistency? • Use links-analysis (PageRank) to rank documents and triples • Use annotated reasoning to rank inferences • Repair each consistency by removing the weakest triple • Read more: • Piero A. Bonatti, Aidan Hogan, Axel Polleres and Luigi Sauro. "Robust and Scalable Linked Data Reasoning Incorporating Provenance and Trust Annotations". In the Journal of Web Semantics (in press).
What about scale? … …using positive (monotonic) rules. Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale
Materialisation • Forward-chaining Materialisation • Avoid runtime expense • Users taught impatience by Google • Pre-compute for quick retrieval • Web-scale systems should scale well • More data = more disk-space/machines Don't materialise too much! One size does not fit all!
OUTPUT: • Flat file of (partial) inferred triples (quads) • INPUT: • Flat file of triples (quads)
What rules? • Let’s look at a recent corpus of Linked Data and see what schema’s inside • (and what the rulesets support) • Open-domain crawl May 2010 • 1.1 billion quadruples • 3.985 million sources (docs) • 780 pay-level domains (e.g., dbpedia.org) • Ran “special” PageRank over documents • 86 thousand docs contained some RDFS/OWL schema data (2.2% of docs... but <0.2% of triples) • Summated ranks of docs using each primitive
Survey of Linked Data schema: Top 15 ranks # Axiom Rank(Σ) RDFS Horst O2R • rdfs:subClassOf 0.295 ✓✓✓ • rdfs:range 0.294 ✓✓✓ • rdfs:domain 0.292 ✓✓✓ • rdfs:subPropertyOf 0.090 ✓✓✓ • owl:FunctionalProperty 0.063 ✘✓✓ • owl:disjointWith 0.049 ✘✘✓ • owl:inverseOf 0.047 ✘✓✓ • owl:unionOf 0.035 ✘✘✓ • owl:SymmetricProperty 0.033 ✘✓✓ • owl:TransitiveProperty 0.030 ✘✓✓ • owl:equivalentClass 0.021 ✘✓ ✓ • owl:InverseFunctionalProperty 0.030 ✘✓✓ • owl:equivalentProperty 0.030 ✘✓✓ • owl:someValuesFrom 0.030 ✘✓✓ • owl:hasValue 0.028 ✘✓✓
ScalableReasoning: In-mem T-Box Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties. Aka. schemata/vocabularies/ontologies/terminologies. E.g., foaf:topic owl:inverseOf foaf:page . sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount . Most commonly accessed datafor reasoning Quite small (~0.1% for our Linked Data corpus) High selectivity (if you prefer) A-Box:Lots?s foaf:page ?o . vs. T-Box:Fewfoaf:page ?p ?o .+?s ?p foaf:page . 44
Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve) Scan 2: Scan all on-disk data, join with in-memory T-Box. ScalableReasoning: Two Scans 45
Scalable Reasoning: No A-Box Joins ON-DISKA-BOX • Execution of three rules: OWL 2 RL ruleprp-inv1 ?p1 owl:inverseOf ?p2 . ?x ?p1 ?y . ⇒ ?y ?p2 ?x . OWL 2 RL ruleprp-rng ?p rdfs:range ?c . ?x ?p ?y. ⇒ ?y a ?c . OWL 2 RL ruleprp-spo1 ?p1 rdfs:subPropertyOf ?p2 . ?x ?p1 ?y. ⇒ ?x ?p2 ?y . ... ex:me foaf:homepage ex:hp . ... IN-MEMT-BOX ON-DISK OUTPUT ... ex:hp rdf:type foaf:Document . ex:me foaf:page ex:hp . ex:hp foaf:topic ex:me . ... 46
Scalable Reasoning: A-Box joins? 47 • However: some rules do require A-Box joins • ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z . ⇒ ?x ?p ?z . • Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?) • Can lead to quadratic inferences • A lot of useful reasoning still possible without A-Box joins…
Features not requiring A-Box joins 48 • rdfs:subClassOf 0.295 ✓ • rdfs:range 0.294 ✓ • rdfs:domain 0.292 ✓ • rdfs:subPropertyOf 0.090 ✓ • owl:FunctionalProperty 0.063 ✘ • owl:disjointWith 0.049 ✘ • owl:inverseOf 0.047 ✓ • owl:unionOf 0.035 ✓ • owl:SymmetricProperty 0.033 ✓ • owl:equivalentClass 0.021 ✓ • owl:InverseFunctionalProperty 0.030 ✘ • owl:equivalentProperty 0.030 ✓ • owl:someValuesFrom 0.030 ✓/✘
Scalable Distributed Reasoning ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... EXTRACTT-BOX EXTRACT T-BOX EXTRACTT-BOX EXTRACTT-BOX EXTRACTT-BOX COLLECTT-BOX COLLECTT-BOX COLLECTT-BOX COLLECTT-BOX COLLECTT-BOX SAMET-BOX SAMET-BOX SAMET-BOX SAMET-BOX SAMET-BOX ... ... ... ... ... DIFF.A-BOX DIFF.A-BOX DIFF.A-BOX DIFF.A-BOX DIFF.A-BOX ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me rdf:type ex:Awesome . ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me ex:presented ex:ThisTal 50