810 likes | 940 Views
Scalable Integration and Processing of Linked Data. Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani. Outline. Session 1: Introduction to Linked Data Foundations and Architectures Crawling and Indexing Querying Session 2: Integrating Web Data with Reasoning
E N D
Scalable Integration and Processing of Linked Data Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani
Outline • Session 1: Introduction to Linked Data • Foundations and Architectures • Crawling and Indexing • Querying • Session 2: Integrating Web Data with Reasoning • Introduction to RDFS/OWL on the Web • Introduction and Motivation for Reasoning • Session 3: Distributed Reasoning: Because Size Matters • Problems and Challenges • MapReduce and WebPIE • Session 4: Putting Things Together (Demo) • The LarKC Platform • Implementing a LarKC Workflow
PART I: How can we query Linked Data? PART 2: How can we reason over Linked Data? (start of Session 2)
Answer: SPARQL (W3C Rec. 2008) …SPARQL 1.1 upcoming (W3C Rec. 201?)
Introducing SPARQL SPARQL Protocol and RDF Query Language (SPARQL) • Standardised query language (and supporting recommendations) for querying RDF • ~SQL-like language • …but only if you squint • …and without the vendor-specific headaches
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE ; foaf:familyName ?surname . QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Prefix Declarations PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX oo: <http://purl.org/openorg/> PREFIX DECLARATIONS foaf:Person ⇔ <http://xmlns.com/foaf/0.1/Person> Use http://prefix.cc/ …
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Result Clause SELECT ?name?expertise RESULT CLAUSE 1. SELECT 2. CONSTRUCT(RDF) 3. ASK 4. DESCRIBE(RDF)
Result Clause 1. SELECT… Return all tuples for the bindings of the variables ?name and ?expertise ----------------------------------------------------------- |“Professor Robert Allen” | “Control engineering”| |“Professor Robert Allen” | “Biomedical engineering”| | “Prof Carl Leonetto Amos”| | |“Professor Peter Ashburn” | “Silicon technology”| |“Professor Robert Allen” | “Control engineering”| ----------------------------------------------------------- SELECT ?name?expertise RESULT CLAUSE Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Result Clause 1. SELECT DISTINCT… Return all tuples for the bindings of the variables ?name and ?expertise ----------------------------------------------------------- |“Professor Robert Allen” | “Control engineering”| |“Professor Robert Allen” | “Biomedical engineering”| | “Prof Carl Leonetto Amos”| | |“Professor Peter Ashburn” | “Silicon technology”| |“Professor Robert Allen” | “Control engineering”| ----------------------------------------------------------- SELECT ?name?expertise DISTINCT unique Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Result Clause 2.CONSTRUCT… CONSTRUCT { ?person foaf:name ?name;ex:expertise ?expertise . } Return RDF using bindings for the variables: ex:RAllen foaf:name “Professor Robert Allen”; ex:expertise “Biomedical engineering” , “Control engineering”. ex:PAshburn foaf:name “Peter Ashburn ” ; ex:expertise “Silicon technology” . RESULT CLAUSE Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Result Clause 3. ASK… ASK … WHERE { … } Is there any results? Returns: trueorfalse RESULT CLAUSE
Result Clause 4. DESCRIBE… DESCRIBE ?person … WHERE { ?person … } Returns some RDF which “describes” the given resource… No standard for what to return! Typically returns: RESULT CLAUSE • all triples where the given resource appears as subject and/or object • OR • Concise Bounded Descriptions…
Result Clause 4. DESCRIBE(DIRECT)… DESCRIBE ex:RAllen (…can give URIs directly without need for a WHERE clause.) RESULT CLAUSE
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Dataset clause (FROM/FROM NAMED) FROM NAMED <http://data.southampton.ac.uk/> DATASET CLAUSE • (Briefly) • Restrict the dataset against which you wish to query • SPARQL stores named graphs: sets of triples which are associated with (URI) names • Can match across graphs! • Named graphs typically corrrespond with data provenance (i.e., documents)! • Default graph typically corresponds to the merge of all graphs • Many engines will typically dereference a graph if not available locally! Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise } } QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Query clause (WHERE) WHERE { ?person foaf:name ?name ; foaf:familyName ?surname . ?person rdf:type foaf:Person . ?person foaf:title ?title . FILTER regex(?title, "^Prof") OPTIONAL { ?person oo:availableToCommentOn ?expertiseURI . ?expertiseURI rdfs:label ?expertise } } QUERY CLAUSE “Professor Peter Ashburn” ex:PAshburn ✓ “Professor” ✓ “Silicon technology” “Ashburn” ex:Silicon Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Quick mention for UNION WHERE { … {?person oo:availableToCommentOn ?expertiseURI . } UNION {?person foaf:interest ?expertiseURI . } … } QUERY CLAUSE • Represent disjunction (OR) • Useful when there’s more than one property/class that represents the same information you’re interested in (heterogenity) • Reasoning can also help, assuming terms are mapped (more later)
The anatomy of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname PREFIX DECLARATIONS RESULT CLAUSE DATASET CLAUSE QUERY CLAUSE SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Solution Modifiers • ORDER BY ?surname SOLUTION MODIFIERS Order output results by surname (as you probably guessed) …also… LIMIT • ORDER BY ?surname LIMIT 10 SOLUTION MODIFIERS Only return 10 results OFFSET • ORDER BY ?surname LIMIT 10 OFFSET 20 SOLUTION MODIFIERS Return results 20‒30 Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
The summary of a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name ; foaf:familyName ?surname . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname Shortcuts for URIs PREFIX DECLARATIONS Which results do you want? RESULT CLAUSE Where should we look? DATASET CLAUSE What are you looking for? QUERY CLAUSE How should results be ordered/split? SOLUTION MODIFIERS Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
Trying out a typical SPARQL query • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> • PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> • PREFIX foaf: <http://xmlns.com/foaf/0.1/> • PREFIX oo: <http://purl.org/openorg/> • SELECT ?name?expertise • FROM NAMED <http://data.southampton.ac.uk/> • WHERE { • ?person foaf:name ?name . • ?person rdf:type foaf:Person . • ?person foaf:title ?title . FILTER regex(?title, "^Prof") • OPTIONAL { • ?person oo:availableToCommentOn ?expertiseURI . • ?expertiseURI rdfs:label ?expertise • } • } • ORDER BY ?surname ; foaf:familyName ?surname . Give me a list of namesof professorsin Southamptonand their expertise (if available), in order of their surname
List of Public SPARQL Endpoints: SparqlEndpoints (W3C Wiki) http://www.w3.org/wiki/SparqlEndpoints (or just use Google)
Coming Soon: SPARQL 1.1 Currently a W3C Working Draft http://www.w3.org/TR/sparql11-query/ (or just use Google)
Highly recommend checking out: “SPARQL by example” By Cambridge Semantics Lee Feigenbaum & Eric Prud'hommeaux http://www.cambridgesemantics.com/2008/09/sparql-by-example/ (or just use Google)
After the break… • Session 1: Introduction to Linked Data • Foundations and Architectures • Crawling and Indexing • Querying • Session 2: Integrating Web Data with Reasoning • Introduction to RDFS/OWL on the Web • Introduction and Motivation for Reasoning • Session 3: Distributed Reasoning: Because Size Matters • Problems and Challenges • MapReduce and WebPIE • Session 4: Putting Things Together (Demo) • The LarKC Platform • Implementing a LarKC Workflow
During the break… Question: Find the people who have won both an academy award for best director and a raspberry award for worst director Endpoint: (that is, if you want to use SPARQL… feel free to use whatever) http://dbpedia.org/sparql/ or http://google.com/(to make it fair) Hint: Look at http://dbpedia.org/page/Michael_Bay andhttp://dbpedia.org/page/Woody_Allenfor examples (The same prefixes therein are understood by the endpoint, …so no need to declare them in the query)
And the answer is… The Winning (?) Query: SELECT DISTINCT ?name WHERE{ ?director dcterms:subject category:Worst_Director_Golden_Raspberry_Award_winners , category:Best_Director_Academy_Award_winners ; foaf:name ?name . } The Answer: …
PART I:How can we query Linked Data? PART 2: How can we reason over Linked Data? …and why?!
… A Web of Data August 2007 November 2007 February 2008 March 2008 September 2008 March 2009 July 2009 September 2010 Images from: http://richard.cyganiak.de/2007/10/lod/; Cyganiak, Jentzsch
Reasoning explicit data implicit data How can consumers query the implicit data
…so what’s The Problem?… …heterogeneity …need to integrate data from different sources
Take Query Answering… foaf:page Gimmewebpages relating to Tim Berners-Lee timbl:i timbl:ifoaf:page?pages .
Hetereogenity inschema… webpage: properties = rdfs:subPropertyOf mo:musicBrainz = owl:inverseOf doap:homepage mo:myspace … foaf:homepage foaf:weblog foaf:primaryTopic foaf:isPrimaryTopicOf foaf:page foaf:topic
Linked Data, RDFS and OWL: Linked Vocabularies … SKOS … Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman
Hetereogenity in naming… Tim Berners-Lee: URIs dblp:100007 timbl:i db:Tim-Berners_Lee identica:45563 = owl:sameAs … fb:en.tim_berners-lee adv:timbl
Returning to our simple query… SKOS mo:myspace foaf:primaryTopic foaf:page foaf:topic doap:homepage foaf:homepage Gimmewebpages relating to Tim Berners-Lee foaf:isPrimaryTopicOf identica:45563 adv:timbl db:Tim-Berners_Lee dblp:100007 fb:en.tim_berners-lee timbl:i timbl:ifoaf:page?pages . ...7 x 6 = 42 possible patterns
Challenges… …what (OWL) reasoning is feasible for Linked Data?
Linked Data Reasoning: Challenges Scalable Expressive Domain-Agnostic Robust
Linked Data Reasoning: Challenges • Scalability • At least tens of billions of statements (for the moment) • Near linear scale!!! • Noisy data • Inconsistencies galore • Publishing errors
Linked Data Reasoning: Challenges • Challenges (Semantic Web Wikipedia Article) • Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal with all of these issues in order to deliver on the promise of the Semantic Web. • Vastness:The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names, and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs. • Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness. • Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty. • Inconsistency:These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction". Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency. • Deceit:This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat.
Noisy Data: Omnipotent Being • Proposition 1 • Web data is noisy. • Proof: • 08445a31a78661b5c746feff39a9db6e4e2cc5cf • sha1-sum of ‘mailto:’ • common value for foaf:mbox_sha1sum • An inverse-functional (uniquely identifying) property!!! • Any person who shares the same value will be considered the same • Q.E.D.
Noisy Data: Redefining everything …and home in time for tea • Alternate proof (courtesy ofhttp://www.eiao.net/rdf/1.0) • rdf:type rdf:type owl:Property . • rdf:type rdfs:label “type”@en . • rdf:type rdfs:comment “Type of resource” . • rdf:type rdfs:domain eiao:testRun . • rdf:type rdfs:domain eiao:pageSurvey . • rdf:type rdfs:domain eiao:siteSurvey . • rdf:type rdfs:domain eiao:scenario . • rdf:type rdfs:domain eiao:rangeLocation . • rdf:type rdfs:domain eiao:startPointer . • rdf:type rdfs:domain eiao:endPointer . • rdf:type rdfs:domain eiao:header . • rdf:type rdfs:domain eiao:runs .
Inconsistent Data: Cannot compute… foaf:Person owl:disjointWith foaf:Document .
…herein, we look at (monotonic) rules. Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale
Rules IF⇒THEN Body/Antecedent/Condition Head/Consequent ?c1 rdfs:subClassOf ?c2 . ?x rdf:type ?c1 . ⇒?x rdf:type ?c2 . • foaf:Person rdfs:subClassOf foaf:Agent . • timbl:me rdf:type foaf:Person . • ⇒timbl:me rdf:type foaf:Agent . Schema/Terminology/ Ontological Instance/Assertional