480 likes | 664 Views
SPARQLing Constraints for RDF. Michael Schmidt, 20.03.2008 joint work with Prof. Georg Lausen, Michael Meier. About… Michael Schmidt. 2001-2006: Studies of Applied Computer Science in Saarbrücken 2006: Started my PhD in Saarbrücken with Prof. Christoph Koch Focus on XML, XQuery, Streams
E N D
SPARQLing Constraints for RDF Michael Schmidt, 20.03.2008 joint work with Prof. Georg Lausen, Michael Meier
About… Michael Schmidt • 2001-2006: Studies of Applied Computer Science in Saarbrücken • 2006: Started my PhD in Saarbrücken with Prof. Christoph Koch • Focus on XML, XQuery, Streams • Since 2007: at Freiburg University with Prof. Georg Lausen • Focus on SPARQL, RDF
Table of Contents • SPARQLing Constraints for RDF • Constraints for RDF • Types of constraints • Encoding of constraints in RDF • Satisfiability • SPARQL in the context of constraints • Extracting constraints with SPARQL • Checking constraints with SPARQL • Exploiting constraints: Semantic Query Optimization • SP2Bench: A SPARQL Performance Benchmark
SPARQLing Constraints for RDF • RDF Data Format • Machine-readable information • Established in the Semantic Web • Constraints • Primary and Foreign Keys • Cardinality Constraints, … bases on • SPARQL Query Language • W3C Recommendation since January
Why Constraints? • Restricting the state space of the database • Maintenance of data consistency (e.g. when data is updated) • Semantic Query Optimization • Better understanding of the data • In our scenario: Translation of Relational Schemata to RDF without loss of information
Our Contribution • Extension of RDF by constraints • Key constraints, cardinality constraints, … • Seamless integration into the RDF Framework • Study of the role of SPARQL in this context • Checking constraints with SPARQL • Specification of user-defined constraints • Optimization of SPARQL queries under constraints (Semantic Query Optimization)
The RDF Data Format • Three Types of Elements • URIs: represent physical or logical resources • Blank nodes: resources without fixed URI • Literals: represent values • RDF Triples: (subject, predicate, object) • subject UUB • predicate U • object UUBU L
Example RDF Triple Subject Predicate Object Person1 name „Joe“ RDF Triple URI Literal Graph Representation: name Person1 „Joe“
RDF Databases • RDF Databases are Collections of Triples ssn „1234“ Person1 name „Joe“ rdf:type rdfs:subClassOf knows Student Person rdf:type ssn „2345“ Person2 name „Pete“ Currently no support for specification of primary/foreign key constraints
Mapping Relational Data to RDF Teachers Students Courses Participants + NOT NULL constraint
A Naive Translation Approach rdf:type Teachers Students t2 s2 t1 s1 faculty faculty name name name name matric matric “Joe“ “CS“ “Fred“ “CS“ “John“ “11111“ “22222“ “Ed“ “11111“ “Fred“ “Joe“ “Fred“ “22222“ “Web“ “Fred“ “DB“ name s_id name s_id taught_by c_id taught_by c_id c2 c1 p2 p1 Courses Participants
Improving the Translation rdf:type Teachers Students t2 s2 t1 s1 faculty faculty name name name name matric matric Joe “CS“ “CS“ Fred 11111 22222 “Ed“ “John“ “Web“ taught_by “DB“ s_id name name s_id c_id taught_by c_id c2 c1 p2 p1 Courses Participants
Encoding Primary Key Constraints • Encoding of constraints in the schema layer • New namespace „rdfc“ • RDF Bags rdfc:Key rdf:Bag rdfc:Key rdf:_1 Teachers name T_Key t2 t1 faculty faculty name name Joe “CS“ Fred “CS“
rdfc:Key rdf:Bag rdfc:Key rdf:_1 Teachers T_Key name t2 t1 faculty faculty rdfc:FKey rdf:Bag rdfc:ref name name Joe “CS“ Fred “CS“ rdf:_1 taught_by C_FKey “Web“ “DB“ name name taught_by taught_by rdfc:FKey c2 c1 Courses
Other Types of Constraints • Let C, C1, C2 be classes and Qi, Ri properties • Primary Keys Key(C,[Q1,…Qn]) • Foreign Keys FKey(C1,[Q1,…Qn],C2,[R1,…Rn]) • Cardinality Constraints Min(C,n,R), Max(C,n,R) for n N • Functionality/Totality Constraints Func(C,Q), Total(C,Q) • Singleton Constraints: Single(C)
RDFS Constraints • Let Ci denote classes, Qi denote properties • Subclass Constraint SubC(C1,C2) • Subproperty Constraint SubP(Q1,Q2) • Property Domain/Range PropD(Q,C), PropR(Q,C) • Restrict the state space of the database • No „axioms“ that are used for inferencing
Satisfiability Given an RDF vocabulary and a set of constraints. Is there a non-empty RDF graph that satisfies the constraints? in general undecidable • Primary keys + Foreign Keys • Singleton • Max-Cardinality • Subclass + Subproperty • Property Domain + Property Range always satisfiable
Satisfiability Given an RDF vocabulary and a set of constraints. Is there a non-empty RDF graph that satisfies the constraints? in general undecidable • Primary keys + Foreign Keys • Singleton • Max-Cardinality • Subclass + Subproperty • Property Domain + Property Range • Min-Cardinality undecidable
Satisfiability Given an RDF vocabulary and a set of constraints. Is there a non-empty RDF graph that satisfies the constraints? in general undecidable • Unary primary keys • Unary foreign keys • Min-Cardinality + Max-Cardinality • Subclass + Subproperty • Property Domain + Property Range decidable in ExpTime
Teachers t2 t1 faculty faculty name name Joe “CS“ Fred “CS“ The SPARQL Query Language Operator AND („.“) SELECT ?name ?faculty WHERE { ?teacher rdf:type Teachers. ?teacher name ?name. ?teacher faculty ?faculty. }
Teachers t2 t1 faculty faculty name name Joe “CS“ Fred “CS“ The SPARQL Query Language Operator FILTER SELECT ?name ?faculty WHERE { ?teacher rdf:type Teachers. ?teacher name ?name. ?teacher faculty ?faculty. FILTER (?name=„Joe“) }
Teachers t2 t1 faculty faculty name name Joe “CS“ Fred “CS“ The SPARQL Query Language Operator OPTIONAL SELECT ?name ?faculty ?title WHERE { ?teacher rdf:type Teachers. ?teacher name ?name. ?teacher faculty ?faculty. OPTIONAL { ?teacher title ?title. } } „Professor“ title
Extracting Primary Key Constraints rdfc:Key rdf:Bag rdfc:Key rdf:_1 Teachers T_Key name … … SELECT ?keyname ?class ?keyatt WHERE { ?class rdfc:Key ?keyname. ?keyname rdf:type rdfc:Key. ?keyname ?bagrel ?keyatt. FILTER (?bagrel!=rdf:type) }
Extracting Foreign Key Constraints rdfc:Key rdf:Bag SELECT ?keyname ?class ?keyatt ?ref WHERE { ?class rdfc:FKey ?keyname. ?keyname rdf:type rdfc:FKey. ?keyname ?bagrel ?keyatt. ?keyname rdfc:ref ?ref. FILTER (?bagrel!=rdf:type && ?bagrel!=rdfc:ref) } ORDER BY ?keyname rdfc:Key Teachers T_Key rdf:_1 name rdfc:ref … rdfc:FKey … rdf:Bag rdfc:FKey rdf:_1 taught_by Courses C_FKey
Checking Constraints with SPARQL • Use SPARQL „ASK“ query form (returns „yes“ exactly if query contains a result, no otherwise) • Constraint checks possible for many natural constraints • Primary Keys + Foreign Keys • Cardinality Constraints • … A SPARQL query checks a constraint C if it returns yes for each graph that violates C, no otherwise.
Checking Constraints with SPARQL • Checking primary key constraints ASK { ?x rdf:type C. ?y rdf:type C. ?x p1 ?p1; [...]; pn ?pn. ?y p1 ?p1; [...]; pn ?pn. FILTER (?x!=?y) } Key(C,[p1,. . . ,pn]) Returns „yes“ exactly if constraint is violated.
Checking Constraints with SPARQL • Checking primary key constraints (example) Teachers ASK { ?x rdf:type Teachers. ?y rdf:type Teachers. ?x name ?name. ?y name ?name FILTER (?x!=?y) } t2 t1 faculty faculty name name Joe “CS“ Fred “CS“ Returns „no“ (i.e., constraint holds)
Checking Constraints with SPARQL • Checking foreign key constraints FKey(C,[p1,. . . ,pn],D,[q1,... qn]) ASK { ?x rdf:type C; p1 ?p1; [...]; pn ?pn. OPTIONAL { ?y rdf:type D; q1 ?p1; [...]; qn ?pn. } FILTER (!bound(?y)) } Returns „yes“ exactly if constraint is violated.
Semantic Query Optimization • Idea: use constraint knowledge to find a more efficient query execution plan • Has been studied in the context of relational and datalog databases… • … and might now be applicable in the context of RDF and SPARQL
Semantic Query Optimization SELECT ?teachername ?coursename ?studentname WHERE { ?course rdf:type Courses; taught_by ?teachername; name ?coursename. ?participant rdf:type Participants; c_id ?teachername; s_id ?studentmatric. ?teacher rdf:type Teachers; name ?teachername. OPTIONAL { ?student rdf:type Students; matric ?studentmatric; name ?studentname. } }
A Solution Candidate Subgraph Teachers Students s1 t2 s2 t1 faculty faculty name name name name matric matric Joe “CS“ “CS“ Fred 11111 22222 “Ed“ “John“ “Web“ taught_by “DB“ s_id name name s_id c_id taught_by c_id c2 c1 p2 p1 Courses Participants
Semantic Query Optimization SELECT ?teachername ?coursename ?studentname WHERE { ?course rdf:type Courses; taught_by ?teachername; name ?coursename. ?participant rdf:type Participants; c_id ?teachername; s_id ?studentmatric. ?teacher rdf:type Teachers; name ?teachername. OPTIONAL { ?student rdf:type Students; matric ?studentmatric; name ?studentname. } } FKey(Participants, [s_id], Student, [matric]) Key(Students,[matric]) Total(Students,[name])
Semantic Query Optimization SELECT ?teachername ?coursename ?studentname WHERE { ?course rdf:type Courses; taught_by ?teachername; name ?coursename. ?participant rdf:type Participants; c_id ?teachername; s_id ?studentmatric. ?teacher rdf:type Teachers; name ?teachername. ?student rdf:type Students; matric ?studentmatric; name ?studentname. } FKey(Courses, taught_by, Teacher, [name]) Key(Teacher, [name])
Semantic Query Optimization SELECT ?teachername ?coursename ?studentname WHERE { ?course rdf:type Courses; taught_by ?teachername; name ?coursename. ?participant rdf:type Participants; c_id ?teachername; s_id ?studentmatric. ?student rdf:type Students; matric ?studentmatric; name ?studentname. } • Other optimizations possible: • Rewriting of filter expressions • Elimination from redundant rdf:type specifications • …
Future Work • Study of other types of constraints and the interaction between constraints • Development of a schematic approach to Semantic Query Optimization • Mapping to SQL/Datalog? • SPARQL-specific semantic optimizations? • Efficient constraint checking algorithms
PART II: SP2Bench • Up-to-date no benchmark for SPARQL has been proposed • LUBM: focus on OWL and reasoning • Loose collection of benchmark queries for LUBM • SP2B fills this gap • Settled in the DBLP scenario • Data generator for creating large arbitrarily large datasets + 16 benchmark queries • Currently submitted for publication, will be made available online soon
The SP2Bench Data Generator • Creates bibliography documents similar to DBLP • Mirrors vital key characteristics found in original DBLP data • Structure of entities (Articles, Journals, Books, …) • Relations between authors • Quantity of entities (development over time) • Citation system Combines the benefits of both a real-world scenario and the possibility to generate arbitrarily large documents.
The DBLP RDF Schema sc sc sc sc sc sc sc sc sc
The SP2Bench Queries • Operate on top of the characteristics that are mirrored by the data generator • Designed to test… • … typical SPARQL operators and combinations • … SPARQL solution modifiers • … existing (but also obvious future) optimizations • … RDF data access patterns • … the impact of indices on data • … and many other characteristics such as result size, different graph patterns, etc.
Benchmark Queries SELECT ?yr WHERE { ?proc rdf:type bench:Journal. ?proc dc:title "Journal 1 (1940)"^^xsd:string. ?proc dcterms:issued ?yr. } Q1 • Simple • Constant result size (exactly 1 result) • Might be answered very fast with index
Benchmark Queries SELECT DISTINCT ?person ?name Q5 WHERE { ?article rdf:type bench:Article. ?article dc:creator ?person. ?inproc rdf:type bench:Inproceedings. ?inproc dc:creator ?person2. ?person foaf:name ?name. ?person2 foaf:name ?name2. FILTER(?name=?name2). } Q5a • Equivalent in our scenario • Tests implicit vs. explicit joins • We found that Q5a is much more challenging for current engines SELECT DISTINCT ?person ?name WHERE { ?article rdf:type bench:Article. ?article dc:creator ?person. ?inproc rdf:type bench:Inproceedings. ?inproc dc:creator ?person. ?person foaf:name ?name. } Q5b
Benchmark Queries SELECT DISTINCT ?title Q7 WHERE { ?class rdfs:subClassOf foaf:Document. ?doc rdf:type ?class. ?doc dc:title ?title. ?bag2 ?member2 ?doc. ?doc2 dcterms:references ?bag2. OPTIONAL { ?class3 rdfs:subClassOf foaf:Document. ?doc3 rdf:type ?class3. ?doc3 dcterms:references ?bag3. ?bag3 ?member3 ?doc. OPTIONAL { ?class4 rdfs:subClassOf foaf:Document. ?doc4 rdf:type ?class4. ?doc4 dcterms:references ?bag4. ?bag4 ?member4 ?doc3. } FILTER (!bound(?doc4)). } FILTER (!bound(?doc3)). } Q7 • Double Closed-World-Negation • Returns all publications that are cited at least once, but only cited by cited publications
Benchmark Results • We tested several SPARQL engines • ARQ • Sesame • Virtuoso • … • Results demonstrate that … • … there are differences between engines • … there is still room for improvement in current implementation • … there is poor support for several SPARQL specifics
Thank you for your attention! • Recourse Description Framework (RDF): Concepts and Abstract Syntax. http://www.w3.org/TR/rdf-schema/. W3C Recommendation, February 10, 2004. • RDF Vocabulary Description Language 1.0: RDF Schema. • http://www.w3.org/TR/rdf-schema/. W3C Recommendation, Febuary 10, 2004. • RDF Semantics. • http://www.w3.org/TR/rdf-mt/. W3C Recommendation, February 10, 2004. • S.T. Shenoy and Z.M. Ozsoyoglu. A System for Semantic Query Optimization. In SIGMOD, pages 181-195, 1987. • SPAQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/. W3C Proposed Recommendation, November 12, 2007. • G.E. Weddell. A Theory of Functional Dependencies for Object-Oriented Data Models. In DOOD, pages 165-184, 1989. • C. Bizer.D2R MAP-A Database to RDF Mapping Language. In WWW (Posters), 2003. • C.Bizer, R.Cyganiak, J. Garbers, and O. Maresch. D2RQ: Treading Non-RDF Relational Databases as Virtual RDF Graphs. User Manual and Language Specification. • J. J. King. QUIST: A System for Semantic Query Optimization in Relational Databases. Distributed systems, Vol. II, pages 287-294, 1986. • G. Lausen. Relational Databases in RDF. In Joint ODBIS & SWDB Workshop on Semantic Web, Ontologies, Databases, 2007. • To appear. • B. Motik, I. Horrocks, and U. Sattler. Bridging the Gap Between OWL and Relational Databases, In WWW, pages 807-816, 2007. • J. Pérez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL. In CoRR Technical Report cs.DB/0605124, 2006.
Teachers t2 t1 faculty faculty name name Joe “CS“ Fred “CS“ The SPARQL Query Language SELECT ?name ?faculty WHERE { { ?teacher rdf:type Teachers. ?teacher name ?name. ?teacher faculty ?faculty. FILTER (?name=„Joe“). } UNION { ?teacher rdf:type Teachers. ?teacher name ?name. ?teacher faculty ?faculty. FILTER (?name=„Fred“). } } Operator UNION