500 likes | 661 Views
Claudio Gutierrez, Carlos Hurtado , Alberto O. Mendelzon. Foundations of Semantic Web Databases. Recall: Semantic Web. The Web is a huge collection of varied interconnected data which lacks of semantic. Therefore, understandable only by humans.
E N D
Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon Foundations of Semantic Web Databases
Recall: Semantic Web • The Web is a huge collection of varied interconnected data which lacks of semantic. Therefore, understandable only by humans. To allow anyone to say anything about anything • The Semantic Web is based on the idea of adding machine understandable semantics to web information via annotations., so that they can perform more of the tedious work involved in finding, sharing and combining information on the web.
Recall: The Relational Model • The rows represent the things you are storing information about. • The columns represent the properties of those things. • The intersection gives the value of that property for that thing.
Recall: RDF book subject JavaScript value title property
Recall: RDF • Resource Description Framework (RDF). • The RDF model was designed with the following goals: simple data model, formal semantics and provable inference, extensible URI-based vocabulary, allowing anyone to make statements about any resource. • RDF statement is the way to describe any resource which can have a URI, through it’s properties using binary predicates and another resource.
Recall: RDF • RDF statement - (Subject, Predicate, Object) ( http://en.wikipedia.org/wiki/Dan_Brown, http://purl.org/dc/elements/1.1/publisher, "Wikipedia“ ) • Or in XML format: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Descriptionrdf:about="http://en.wikipedia.org/wiki/Dan_Brown"> <dc:publisher>Wikipedia</dc:publisher> … </rdf:Description> </rdf:RDF>
Recall: Ontology and RDFS • RDF lacks the ability of expressing the relations between objects (e.g. Cat is an Animal, Book has an Author). • RDF Schema (also called RDFS vocabulary) provides additional information about properties, e.g. adds information about the classes and properties of resources and the relations between them.
Recall: RDF Schema • RDFS main constructs: Class, subClassOf, Property, subPropertyOf, Object, Predicate, Subject, Range, Domain, Type, etc… A: (John, Class, Man) B: (Man, subClassOf, Person) C: (A, Subject, John) • Enables “Duck Typing”. Reification
Recall: RDF Query Languages • Given data which is represented by RDF format, the query language (e.g. SPARQL) enables to retrieve and manipulate the data. • Like in other querying languages we would like to “filter” and reorganize the data. Although the data can be part of different DBs, and represented in different formats, its semantic is represented with RDFS and ontologies, common to all of the data.
The Problem ? RDF DB !
The Problem ? RDF DB RDF DB RDF DB RDF DB RDF DB ! ! ! ! !
The Problem ? ! = ! ≠ ! ! RDF DB RDF DB RDF DB RDF DB RDF DB ! ! ! ! U ! ! ! ! !
The Problems • Different representation of the data (no normal form) and redundancy elimination. • Equivalence (of DBs, queries and answers). • Entailment and containment of queries. • The impact of predefined semantics (RDFS vocabulary), blank nodes, reification and premises on queries. • Complexity issues.
Blank Nodes\Resource • Blank node of resource is a resource in RDF DB (or graph), which is not identified by URI (Universal Resource Identifier). (John, knows, _:p1) (_:p1, birthDate, 04-21) “exist _:p1 who is known by John and his date of birth is the 21st of April” • Enables partial understanding when information is missing. • We will use letters N,X,Y,… to donate blank nodes.
RDF Graphs • For a given triple (Subj, Pred, Obj) • RDF graph G is a set of triples. Pred Subj Obj UBL (Resources) U (URIs) B (Blank Nodes) L (Literals)
RDF Graphs • The universe of a graph is the set of elements of UBL, which occur in the triples of G, universe(G). • The vocabulary of a graph G is the set of elements of UL, which occur in the triples of G. • A graph is ground if it has no blank nodes. • The union of G1, G2 is the union of their sets of triples, donate by G1∪G2. • The merge of G1, G2 is the union of their sets of triples, where the sets of blank nodes are disjoint, donate by G1+G2. (merge is safe)
RDF Graphs sc sc sc sc a X c a Y c G1 G2 sc sc G1 ∪G2 a X c X sc sc G1 +G2 a c sc sc Y
RDFS Vocabulary • Describes properties like attributes of resources, and relationships between them. Also enable to make statements about statements, reifications. For a given triple N:(a, b, c) occurs in http://... occurs type stat N http://... subj obj pred a b c
Maps • Map is a function μ:UBL→UBL. • μ is consistent with graph G, if μ(G) is RDF graph. And μ(G) is an instance of G. • An instance is proper if it has fewer blank nodes. • Overloading the meaning of map, μ:G1→G2 if there is a map μ such that μ(G1) is subgraph of G2.
Graph Isomorphism • Two graphs G1 and G2 are isomorphic if there are maps μ1 and μ2 such that μ1(G1)=G2 and μ2(G2)=G1, donated by G1≃G2.
Graph Isomorphism a g 1 2 ƒ(a) = 1 ƒ(b) = 6 ƒ(c) = 8 ƒ(d) = 3 ƒ(g) = 5 ƒ(h) = 2 ƒ(i) = 4 ƒ(j) = 7 5 6 b h 8 7 c i 3 4 d j
Lean Graphs • A graph G is lean, if there is no map μ such that μ(G) is a proper subgraph of G. p p X a X a q p p r Y b Y G1 G2
Core • Theorem:Each RDF graph G contains a unique (up to isomorphism) lean subgraph which is an instance of G. We will denote this unique subgraph by core(G). • Theorem: • Deciding if G is lean is coNP-complete (reduction to tautology). • Deciding if G’ ≃ core(G) is DP-complete.
Graph Interpretation • An interpretation I of RDF graph G: • A non-empty set of resources Res. • The literals, a subset Lit⊆Res. • A set of binary properties Prop⊆ResXRes. • Mapping from the vocabulary of G, URes∪Prop and LLit.
Entailment & Equivalence • An RDF graph G1entails G2, denoted G1 |= G2, iff every interpretation over the vocabulary of G1∪G2 which satisfies G1 also satisfies G2. • We say that two graphs are equivalent, denoted G1≡G2, if G1 |= G2 and G2 |= G1.
Semantics of Simple RDF Graphs • A simple RDF graphs is a graph that do not use vocabulary with a predefined semantics. • Theorem: A simple RDF graph G1entails G2, denoted G1 |= G2, if and only if there is a map G2G1. • A graph entail any of its subgraphs. p q p q |= b a c b X c
Semantics of Simple RDF Graphs • Theorem: • Deciding entailment of simple RDF graphs is NP-complete. • Deciding equivalence of simple RDF graphs is isomorphism-complete. • Both depends heavily on the set of blank nodes. Can be done in O(vn), where v the set of nodes and n the blank nodes. • Theorem: If G is simple, then core(G) is the unique minimal graph equivalence to G.
Semantics of RDF Graphs with RDFS Vocabulary • The following deductive system is sound & complete:
Semantics of RDF Graphs with RDFS Vocabulary • Theorem: G1 |= G2, if and only if there is a sequence operations starts from G1 and ends with G2. NP-complete. • There is no mapping from G2G1 although G1 |= G2. • The idea is to “close” the graph with all possible triples. sc sc c c sc sc d d b b sc sc sc X a a sc G1 G2
Closure • A closure of a graph G is a maximal set of triples G’ over universe(G) plus the RDFS vocabulary such that G’ contains G and is equivalent to it. • There could be more than one closer for a graph. • The closer may have a redundancies. • The problem of deciding if G’ is the closure of G is DP-complete. q d b X p r p c a p
Normal Form • A normal-form of a graph G, donated nf(G), is the core(G’) for the closer G’ of G. • Theorem: Let G be an RDF graph: • The normal-form, nf(G) is unique. • G1 |= G2 if and only if nf(G2)nf(G1). • G1≡G2 if and only if nf(G1)≃nf(G2). • The problem of deciding if G’ is the normal form of G is DP-complete.
Normal Form sc sc c c sc sc d d b b sc sc sc X a a sc G1 G2 sc c nf is not the most compact representation. sc d sc b sc sc sc a nf(Gi)
Query Language • The RDF database will be the RDF graph. • Let V be the set of variables donated by ?X, ?Y. • The query form is Datalog like HB, where H and B contain variables. (?X, ancestor, ?Y) (?X, ancestor, ?Z), (?Z, ancestor, ?Y) • The condition var(H)⊆var(B) avoids the presence of free variables in the head of the query. • The presence of blank nodes in the body plays the same rule as variable , therefore is unnecessary.
Query Language • Query can have a set of premises P and constrains C. Query is a tuple (H, B, P, C). • The set of constrains C gives the user the possibility to discriminate between blank and ground nodes in the answer. • The premise P represents information the user supplies to the database to be queried in order to answer the query. E.g. the ability to query incomplete information by supplying information not in the DB or adding semantic information like (son, sp, relative) .
Answer to a Query • Let q = (H, B, P, C) be a query, D a database and V set of variables. • A valuation v is function v:VUBL for all variables x in B. And for all variables x in C, v(x) is not a blank node. • A pre-answer to q over D is the set single answers v(H): preans(q,D) = {v(H): v(B)⊆nf(D+P) and v|=C}
Answer to a Query • Composing a complex query from simpler once. • ansu(q,D) is the union of all single answers (blank nodes play the rule of bridges between two single answers). • Ans+(q,D) is the merge of all single answers (renaming blank nodes to avoid names clashes). Useful when querying to several sources. • Let q be a query: • If D’|=D then ans(q,D’) |=ans(q,D). • For all D, ansu(q,D)|=ans+(q,D) (the converse is not true).
Reification • The ability of identifying RDF statements. • By having a blank nodes in the head of the query, one can identify a statement. (N, value, true), (N, type, stat), (N, subj, ?X), (N, pred, ?Y ), (N, obj, ?Z) (?X, ?Y, ?Z) • Can cause an infinite DB. If statement i1 (a,b,c) is a valid then statement i2 (i1, subj, a) is also and the statement (i2, subj, i1), and so on.
Query Containment • Exploring different notions of query containment. • In relational databases, set-theoretical inclusion of tuples captures this requirement. • Let q and q’ be queries, and for all databases D: • q⊆pq’ , iffpreans(q,D)⊆preans(q’,D) up to isomorphism. • q⊆mq’ , iffans(q’,D)|=ans(q,D). • Let q and q’ be queries, q⊆pq’ entails that q⊆mq’. The converse is not true. • Theorem: Deciding each one of them is NP-complete.
Query Containment For example: H=B=(X, sc, Y), (Y, sc, Z) H’=B’=(X, sc, Y), (Y, sc, Z), (X, sc, Z) q’⊆mq and q⊆mq’ is true, but NOT q’⊆pq or q⊆pq’
Query Containment • Consider the queries q=(H,B,P,C) and q’=(H’,B’,P’,C’), and assume H,H’,B,B’, P, P’ are simple graphs. • Theorem: Then q⊆pq’ if and only if for each map μ on the variables of B, there is a substitution (of variables and blank nodes) Θμsuch that: • Θμ(B’)⊆P’+(B−μ(B,P)), where μ(B,P) is the set of triples t of B such that μ(t)∋P. • Θμ(H’)=H. • Θμ(C’)⊆C.
Query Containment • Consider the queries q=(H,B,P,C) and q’=(H’,B’,P’,C’), and assume H,H’,B,B’, P, P’ are simple graphs. • Theorem: Then q⊆mq’ if and only if there are substitutions (of variables) Θ1,…, Θn such that: • Θj(B’)⊆nf(B). • ∪jΘj(H’)|=H. • Θj(C’)⊆C.
Complexity of Query Answering • The complexity of the evaluation problem of testing emptiness of the query answer set in two versions: • Query complexity version: For a fixed database D, given a query q, is q(D) non-empty? NP-complete • Data complexity version: For a fixed query q, given a database D, is q(D) non-empty? polynomial • The size of the set of the answer is bounded by |D||q|.
Redundancy Elimination – In Graphs • A reduction of a graph G is a minimal graph Gr equivalent to G and contained in G. • Algorithm computing the reduction of a graph G: • Gnf(G) • Apply reverse rules 7), 8), 9), 4), and 3) and 6) in this order until no longer applicable. • Apply any reverse rule in any order until no longer applicable. • Theorem: The problem of deciding if G’ is the reduction of G is DP-complete.
Redundancy Elimination – In Queries • Avoiding redundancy in query answer with lean query heads. • Lean query’s body is not always possible, and may cause for missing an answer. • Even having lean databases and queries with lean heads and bodies does not avoid redundancies. For example: G1 is the answer to the query (?Z, p, ?U)(?Z, p, ?U) on G2 p p X a X a q p p r Y b Y G1 G2
Redundancy Elimination –In Queries • The naive approach to eliminate redundancy in answers is to compute: • ans(q,D), and • a lean equivalent to ans(q,D). • Theorem: Given a lean database D and a query q, to decide whether ans∪(q,D) is lean is coNP-complete (in the size of D). • Theorem: Given a lean database D and a query q, to decide whether ans+(q,D) is lean can be done in polynomial time in the size of D
Contributions • Normal form. • A formal definition of query language for RDF and its main features. • Query containment and processing. • Redundancy elimination. • From entailment to mapping between graphs. • Complexity issues.
References • Foundations of Semantic Web Databases – Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon (2004) • RDF Semantics – W3C Working Draft (2003) • Composing Web Services on the Semantic Web – Vadim Eisenberg • Special thanks to Google and Wikipedia.