Foundations of Semantic Web Databases: Formal RDF Query Language Study

Foundations of Semantic Web Databases Gutierrez, Hurtado and Mendelzon Presented by: Nir Zepkowitz

Background • The web is a huge collection of interconnected data. • The web lacks semantic information so managing and processing the data is hard. • Semantic web – proposal to build an infrastructure of machine-readable semantic for the data on the web.

Semantic web • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila • "If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in the world look like one huge database." Tim Berners-Lee, Weaving the Web, 1999

Background (cont) • In 1998 the W3C offered the language that will be the basis for that infrastructure – the Resource Description Framework (RDF). • RDF is being implemented in many world-wide initiatives and gets a lot of attention.

Why are we here? • Query languages for RDF were developed side by side with RDF. • Little research about the foundations of RDF and its query languages. • This research is necessary because of the new features that arise in querying RDF graphs (as opposed to standard DB).

Some problems • The RDF data model allows several representations for the same information. • Is there a normal form? Is there a way to check equivalence?

What we are going to do? • Study formal aspects of querying DBs containing RDF data. • New notation of normal form for RDF graphs. • Give formal definition of query language for RDF. • Investigate theoretical and complexity aspects related to query processing and redundancy.

The RDF model • U – RDF URI references. • B – blank nodes. • L – RDF literals. • RDF triple: • V1 – subject, v2 – predicate, v3 – object.

Definitions • Graph is a set of triples. • Universe(G) – set of UBL elements that appear in a triple of G. • Vocabulary of G – • A graph is ground if it has no blank nodes.

Definitions • Map: a function (UBL->UBL) preserving URIs and literals (μ(u) = u). • μ(G) – a set (μ(s), μ(p), μ(o)) s.t. (s,p,o) in G. • μ is consistent with G if μ(G) is RDF graph. • In this case we denote μ(G) an instance of G. • An instance is proper if μ(G) has fewer blank nodes than G.

Definitions • G1,G2 are isomorphic ( ) if there are maps μ1, μ2 s.t. μ1(G1)=G2 and μ2(G2)=G1 • Union of graphs (G1UG2) is the union of their triples. • Merge of graphs (G1+G2) is G1UG2’ where G2’ is isomorphic to G2 and its blank nodes are disjoint with that of G1. (there is no relation between the graphs)

RDFS • Extended version of RDF. • Defines classes and properties that may be used for describing groups of resources and relationships between resources. • Supports: reification (making statements about statements), typing and inheritance. The triple (a,b,c) occurs in page http:

Lean graphs • G is lean if there is no map μ s.t. μ(G) is a proper sub-graph of G. G1 is not lean G2 is lean

Core(G) • Theorem: each RDF graph G contains a unique lean sub-graph which is an instance of G. • We will denote this unique sub-graph: core(G).

Semantics of RDF graphs • Theorem 3: Let G1, G2 be simple (do not use predefined semantics) graphs. G1 entails G2 (G1╞ G2) iff there is a map G2->G1 (there is a map s.t. μ(G1) is sub-graph of G2).

It does not follow that Rivera painted Zapata Its follows that there is a cubist that painted Guernica

Equivalence • G1 and G2 are equivalent (G1≡G2) if G1╞ G2 and G2╞ G1. • Theorem: if G is simple, then core(G) is the unique (up to isomorphism) minimal (w.r.t number of triples) graph equivalent to G. unique lean sub-graph

There is a sound and complete set of rules for ╞ in graphs with RDFS-vocabulary. • For example: (a,sc,b), (b,sc,c) -> (a,sc,c). • In non-simple graphs we can not use theorem 3 because of issues like transitivity. Sub class Let G1, G2 be simple (do not use predefined semantics) graphs. G1 entails G2 (G1╞ G2) iff there is a map G2->G1.

The two are equivalent, but there is no mapping G1->G2

Closure • To avoid the problem we will “close” the graph with all possible triples that are entailed by the existing ones. • A closure of G is a maximal set of triples G’ over universe(G’) plus the rdfs-vocabulary s.t. G’ contains G and is equivalent to G.

RDFS-closure • Closure of G under the set of RDFS rules. • By using this definition we can prove that: G1╞ G2 iff there is a map from G2 to the RDFS closure of G1.

Redundancy • From the data representation point of view, “closure” and “RDFS-closure” may have redundancies. They are not the best choice to work with. • The operator core does not eliminate all redundant triples.

Normal from • G’s normal form (nf(G)) is core(G’), where G’ is closure of G. • If G is RDF graph then: • nf(G) is unique. • G1╞ G2 iff nf(G2)->nf(G1). • G1≡G2 iff

Redundancy • Normal forms are not the most compact representation. • A reduction of a graph G is a minimal graph Gr equivalent to G and contained in G. • The writers of the article present an algorithm to get the reduction of a graph. • The basic idea is to delete triplets deduced by RDFS rules.((a,sc,b), (b,sc,c) -> (a,sc,c)).

Querying RDF Databases • RDF graph can be viewed as standard relational database. • Each tuple in the table is a triplet with the attributes: subject, predicate and object.

Query language • Variables (disjoint from UBL) will be denoted ?X, ?Y, ?person. • The query language will be similar to datalog: • (?A,creates,?Y) <- (?A,type,Flemish), (?A,paints,?Y), (?Y,exhibited,?Gordon) • “define the artifacts created by Flemish artists being exhibited in the Gordon gallery”.

Tableau (H<-B) • A tableau is a pair (H,B). • H and B are RDF graphs. • Some UBLs are replaced by variables in V. • All variables in H occur in B.

Query • Query is a tableau (H,B) plus a set of premises P and a set of constraints C. • P is a graph over UBL. • C is a subset of the variables occurring in H. • We can think of a query as the tuple (H,B,P,C). • When P/C are omitted: assume they are Φ

Constraints • Allow to discriminate between blank and ground nodes in an answer (IS NOT NULL). • If we add the constraint {?A} this means that ?A variable must be bound to a non-blank element in each answer to the query.

Premises • The premise represents information that the user supplies to the database in order to answer the query. • (?X,relative,Peter) <- (?X,relative,Peter) • P={(son,sp,relative)} • All relatives of Peter knowing that “son” is a sub-property of “relative”.

Premises (cont) • Allows hypothetical analysis. • Fixed premises through all the query. • Allows black nodes but no variables.

Answering a query • Valuation is a function: V->UBL. • For a set C of variables, the valuation v satisfies the constraint C, if for all x in C v(x) is not blank. • v(B) is the graph obtained after replacing every occurrence of a variable x in B with v(x).

Matching • Matching of a graph B in DB D is a valuation v s.t. . • The matchings that interest us are the ones that satisfy C.

Single answer • Let (H,B,P,C) be a query and D a DB. • Pre-answer of q over D is: • preans(q,D)={v(H) : v is a matching of B in D+P and v satisfies C}. • A graph v(H) in preans(q,D) is called a single answer of query q over D.

Complex queries • We would like complex queries to be composed form simple ones: • ansu(q,D) – good when we want blank nodes to play the role of bridges between two queries. • ans+(q,D) – (merge) renaming blank nodes to avoid name clashes. Good when querying several unrelated DBs.

Reification • We allow blank nodes in the head of queries. • Main motivation is the reification vocabulary. • In the RDF semantics statement does not have an identifier. • To refer to a statement we must give it a name (blank node) – reification process.

Reification • It allows us to say something about statements. • (N,value,true), (N,type,stat), (N,subj,?X), (N,pred,?Y), (N,obj,?Z) <- (?X,?Y,?Z). • If the DB is Britanica then: “all statements made by Encyclopedia Britanica are true”.

Reification - Problem • By RDF specification, RDF graph (DB) is a finite set of objects. • Answers to queries are finite set also. • If a triple itself is an object i1 then having (a,b,c) in the DB would imply (i1,subj,a)…… • We get infinite sets.

Query complexity • We consider a simpler version: • Query complexity version: fixed DB D, given a query q, is q(D) is non-empty? • Date complexity version: fixed query q, given a DB D, is q(D) non-empty? • Theorem: the evaluation problem is NP-complete for the query complexity version and polynomial for the data complexity version.

Query complexity • The size of the set of answers of a query q over a DB D is |D||q|. • |D| - size of the normal form of D. • |q| - the number of symbols in the query.

What we saw • RDF model. • RDF semantics. • Normal forms of RDF graphs. • Querying RDF databases. • Query complexity.

Foundations of Semantic Web Databases: Formal RDF Query Language Study