1 / 140

Scalable RDF Data Management & SPARQL Query Processing

Scalable RDF Data Management & SPARQL Query Processing. Martin Theobald 1 , Katja Hose 2 , Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of Aalborg, Denmark 3 University of Passau, Germany. Outline of this Tutorial. Part I RDF in Centralized Relational Databases

maida
Download Presentation

Scalable RDF Data Management & SPARQL Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable RDF Data Management & SPARQL Query Processing Martin Theobald1, KatjaHose2, Ralf Schenkel3 1University of Antwerp, Belgium 2University of Aalborg, Denmark 3University of Passau, Germany

  2. Outline of this Tutorial • Part I • RDF in CentralizedRelational Databases • Part II • RDF in Distributed Settings • Part III • Managing Uncertain RDF Data

  3. Outline for Part I Part I.1: Foundations IntroductiontoRDF andLinked Open Data A shortoverviewof SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

  4. Information Extraction YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes & categories)

  5. YAGO2 Knowledge Base 3 M entities, 120 M facts 100 relations, 200k classes Entity accuracy  95% subclass subclass subclass Organization Person Location subclass subclass subclass subclass subclass Country Scientist Politician subclass subclass State instanceOf instanceOf Biologist instanceOf Physicist City instanceOf Germany instanceOf instanceOf locatedIn Erwin_Planck Oct 23, 1944 diedOn locatedIn Kiel Schleswig-Holstein FatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means means means means means “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel” http://www.mpi-inf.mpg.de/yago-naga/

  6. Why care about scalability? Sources: linkeddata.org wikipedia.org Rapidgrowthofavailablesemanticdata

  7. Why care about scalability? Sources: linkeddata.org wikipedia.org As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase Rapid growth of available semantic data More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples

  8. … andstill growing • Billion triple challenge 2008: 1B triples • Billion triple challenge 2010: 3B tripleshttp://km.aifb.kit.edu/projects/btc-2010/ • Billion triple challenge 2011: 2B tripleshttp://km.aifb.kit.edu/projects/btc-2011/ • | • War stories from http://www.w3.org/wiki/LargeTripleStores • BigOWLIM: 12B triples in Jun 2009 • Garlik 4Store: 15B triples in Oct 2009 • OpenLink Virtuoso: 15.4B+ triples • AllegroGraph: 1+ Trilliontriples

  9. Queries can be complex, too SELECT DISTINCT ?a ?b ?lat ?long WHERE { ?a dbpedia:spouse ?b. ?a dbpedia:wikilinkdbpediares:actor. ?b dbpedia:wikilinkdbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long. } Q7 on BTC2008 in [Neumann & Weikum, 2009]

  10. What effects does the financial crisis have on migration rates in the US?

  11. Is there a significant increase of serious weather conditions in Europe over the past 20 years?

  12. Which glutamic-acid proteases are inhibitors of HIV?

  13. Question Answering (QA) Systems • KB from Wikipedia and user edits • 600 million facts, 25 million entities • KB of curated, structured data • 10 trillion (!) facts, 50k algorithms

  14. IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge back-ends D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project.AI Magazine, Fall 2010. YAGO www.ibm.com/innovation/us/watson/index.htm

  15. SPARQL 1.0 / 1.1 Query language for RDF suggested by the W3C. 3 ways to interpret RDF data: Instances of logical predicates (“facts”) Graphs(subjects/objects as nodes, predicates as directed and labeled edges) Relations(either multiple binary relations or a single, large ternary relation) SPARQL main building block: select-project-join combination of relational triple patterns equivalent to graph isomorphism queries over a potentially very large RDF graph

  16. SPARQL – Example scientist isA isA actor vegetarian physicist chemist isA isA isA isA isA isA Mike_Myers Jim_Carrey Albert_Einstein Otto_Hahn bornIn bornIn bornIn bornIn Scarborough Newmarket Ulm Frankfurt locatedIn locatedIn locatedIn locatedIn Ontario Germany locatedIn locatedIn Europe Canada Examplequery:Find all actorsfrom Ontario (thatare in theknowledgebase)

  17. SPARQL – Example actor constants isA ?person variables bornIn ?loc locatedIn Ontario Examplequery:Find all actorsfrom Ontario (thatare in theknowledgebase) SELECT?personWHERE{ ?personisAactor. ?personbornIn?loc.?loclocatedInOntario . } scientist Find subgraphsofthis form: isA isA actor vegetarian physicist chemist isA isA isA isA isA isA Mike_Myers Jim_Carrey Albert_Einstein Otto_Hahn bornIn bornIn bornIn bornIn Scarborough Newmarket Ulm Frankfurt locatedIn locatedIn locatedIn locatedIn Ontario Germany locatedIn locatedIn Europe Canada

  18. SPARQL 1.0 – More Features • Eliminate duplicates in results • Return results in some order with optional LIMIT n clause • Optional matches and filters on bounded var’s • More operators: ASK, DESCRIBE, CONSTRUCT • See:http://www.w3.org/TR/rdf-sparql-query/ SELECTDISTINCT?cWHERE {?personisAactor. ?personbornIn?loc.?loclocatedIn?c} SELECT?personWHERE {?personisAactor. ?personbornIn?loc.?loclocatedInOntario} ORDER BY DESC(?person) SELECT?personWHERE {?personisAactor.OPTIONAL{?personbornIn?loc}.FILTER (!BOUND(?loc))}

  19. SPARQL 1.1 Extensionsofthe W3C W3C SPARQL 1.1: Aggregations (COUNT, AVG, …) and grouping Subqueries in the WHEREclause Safe negation: FILTER NOT EXISTS {?x …} Syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x)) Expressions in the SELECT clause:SELECT(?a+?b) AS ?sum Label constraints on paths:?xfoaf:knows/foaf:knows/foaf:name?name More functions and operators …

  20. RDF+SPARQL: Centralized Engines BigOWLIM(now ontotext.com) OpenLink Virtuoso OntoBroker(now semafora-systems.com) Apache Jena(different main-memory/relational backends) Sesame (now openRDF.org) SW-Store, Hexastore, 3Store, RDF-3X (no reasoning) System deployments with >1011 triples ( see http://esw.w3.org/LargeTripleStores)

  21. SPARQL: Extensions from Research (1) More complex graph patterns: Transitive paths [Anyanwu et al., WWW’07]SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter(containsAny(??r,?t ). ?t isA City . } Regular expressions [Kasneci et al., ICDE’08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p(bornIn | livesIn | citizenOf) locatedIn*Europe.} Meanwhile mostly covered by the SPARQL 1.1 query proposal.

  22. SPARQL: Extensions from Research (2) Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ) Automatically route triple predicates to useful sources

  23. SPARQL: Extensions from Research (2) Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ) Automatically route triple predicates to useful sources Potentially requires mapping of identifiers from different sources SPARQL 1.1 explicitly supportsfederation of sources http://www.w3.org/TR/sparql11-federated-query/

  24. Ranking is Essential! Queries often have a huge number of results: “scientists from Canada” “publications in databases” “actors from the U.S.” Queries may have no matches at all: “Laboratoired'informatique de Paris 6” “most beautiful railway stations” Ranking is an integral part of search Huge number of app-specific ranking methods:paper/citation count, impact, salary, … Need for generic ranking of 1) entities and 2) facts

  25. Extending Entities with Keywords Remember: entities occur in facts &in documents Associate entities with terms in those documents, keywords in URIs, literals, …(context of entity) chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy

  26. Extensions: Keywords Problem: not everything is triplified! • Consider witnesses/sources • (provenance meta-facts) • Allow text predicates with • each triple pattern (à la XQ-FT) • Semantics: • triples match struct. pred. • witnesses match text pred. European composerswhohavewonthe Oscar, whosemusicappeared in dramatic western scenes, andwho also wroteclassicalpieces ? Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie[western, gunfight, duel, sunset] . ?p composed ?music[classical, orchestra, cantata, opera] . }

  27. Extensions: Keywords Problem: not everything is triplified! • Consider witnesses/sources • (provenance meta-facts) • Allow text predicates with • each triple pattern (à la XQ-FT) Proximity of keywords or phrases boosts expressiveness French politiciansmarriedtoItaliansingers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . } CS researcherswhoseadvisorsworked on the Manhattan project? Select ?r, ?a Where { ?r instOfresearcher[“computerscience“] . ?a workedOn ?x [“Manhattan project“] . ?r hasAdvisor ?a . } Select ?r, ?a Where { ?r ?p1 ?o1 [“computerscience“] . ?a ?p2 ?o2 [“Manhattan project“] . ?r ?p3 ?a . }

  28. Extensions: Keywords Problem: not everything is triplified! CLEF/INEX 2012-13 Linked Data Track https://inex.mmci.uni-saarland.de/tracks/lod/ <topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft> </topic>

  29. Extensions: Keywords / Multiple Languages Problem: not everything is triplified! Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/ <question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">WelchenFlussüberspannt die Brooklyn Bridge?</string> <string lang="es">¿Porquéríocruza la Brooklyn Bridge?</string> <string lang="it">Quale fiumeattraversailponte di Brooklyn?</string> <string lang="fr">Quellecoursd'eauesttraversé par le pont de Brooklyn?</string> <string lang="nl">Welkerivieroverspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">coursd'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridgedbo:crosses ?uri . } </query> </question>

  30. What Makes a Fact “Good”? • Confidence: • Prefer results that are likely correct • accuracy of info extraction • trust in sources • (authenticity, authority) bornIn (Jim Gray, San Francisco) from “Jim Gray was born in San Francisco” (en.wikipedia.org) livesIn (Michael Jackson, Tibet) from “Fans believe Jacko hides in Tibet” (www.michaeljacksonsightings.com) • Informativeness: • Preferresultswithsalientfacts • Statistical estimationfrom: • frequency in answer • frequency on Web • frequency in query log q: Einstein isa ? Einstein isascientist Einstein isa vegetarian q: ?x isa vegetarian Einstein isa vegetarian Whocaresisavegetarian Diversity: Prefervarietyoffacts E won … E discovered … E played … E won … E won … E won … E won … • Conciseness: • Prefer results that are tightly connected • size of answer graph • weight of Steiner tree Einstein won NobelPrize Bohr won NobelPrize Einstein isa vegetarian Cruise isa vegetarian Cruise born 1962 Bohr died 1962

  31. How Can WeImplement This? PageRank-style entity/fact ranking [V. Hristidis et al., S.Chakrabarti, …] or IR models: tf*idf … [K.Chang et al., …] Statistical Language Models [de Rijke et al.] • Confidence: • Prefer results that are likely correct • accuracy of info extraction • trust in sources • (authenticity, authority) Empirical accuracy of Information Extraction PageRank-style estimate of trust combine into: max { accuracy (f,s) * trust(s) | s  witnesses(f) } • Informativeness: • Prefer results with salient facts • Statistical estimation from: • frequency in answer • frequency on Web • frequency in query log Diversity: Prefer variety of facts Statistical Language Models [Zhai et al., Elbassuoni et al.] • Conciseness: • Prefer results that are tightly connected • size of answer graph • weight of Steiner tree Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …]

  32. Outline for Part I Part I.1: Foundations Introductionto RDF A shortoverviewof SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

  33. RDF in Rowstores • Rowstore: general relational database, storing relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …) • General principles: • store triples in one giant three-attribute table(subject, predicate, object) • convert SPARQL to equivalent SQL • The database will do the rest • Strings often mapped to unique integer IDs • Used by many TripleStores, including 3Store, Jena, HexaStore, RDF-3X, … Simple extension to quadruples (with graphid): (graph,subject,predicate,object) We consider only triples for simplicity!

  34. Example: Single Triple Table subjectpredicateobject ex:Katjaex:teachesex:Databases ex:Katjaex:works_forex:MPI_Informatics ex:Katjaex:PhD_fromex:TU_Ilmenau ex:Martinex:teachesex:Databases ex:Martinex:works_forex:MPI_Informatics ex:Martinex:PhD_fromex:Saarland_University ex:Ralfex:teachesex:Information_Retrieval ex:Ralfex:PhD_fromex:Saarland_University ex:Ralfex:works_forex:Saarland_University ex:Ralfex:works_forex:MPI_Informatics ex:Katjaex:teachesex:Databases; ex:works_forex:MPI_Informatics; ex:PhD_fromex:TU_Ilmenau. ex:Martinex:teachesex:Databases; ex:works_forex:MPI_Informatics; ex:PhD_fromex:Saarland_University. ex:Ralfex:teachesex:Information_Retrieval; ex:PhD_fromex:Saarland_University; ex:works_forex:Saarland_University, ex:MPI_Informatics.

  35. Conversion of SPARQL to SQL General approach to translate SPARQL into SQL: (1) Each triple patternis translated into a (self-) JOIN over the triple table (2) Shared variablescreate JOIN conditions (3) Constants create WHERE conditions (4) FILTER conditions create WHERE conditions (5) OPTIONAL clauses create OUTER JOINS (6) UNION clauses create UNION expressions

  36. Example: Conversion to SQL Query Projection P4    P3 Filter regex(?u,“Saar“) P2 P1 SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, “Saar”)) ?a ?a,?u SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” SELECT R1.A, R1.B, R2.T FROM ( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=“teaches”) AS R2 ) ON (R1.A=R2.A) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) SELECT FROM Triples P1, Triples P2, Triples P3 ?u

  37. Is that all? Well, no. • Whichindexesshouldbebuilt?(tosupportefficientevaluationoftriplepatterns) • Howcanwereducestoragespace? • Howcanwe find thebestexecution plan? • Existing databases need modifications: • flexible, extensible, generic storage not needed here • cannot deal with multiple self-joins of a single table • often generate bad execution plans

  38. Dictionary for Strings Map all strings to unique integers (e.g., via hashing) • Regular size (4-8 bytes), much easier to handle • Dictionary usually small, can be kept in main memory <http://example.de/Katja>  194760 <http://example.de/Martin>  679375 <http://example.de/Ralf> 4634 This may break original lexicographic sorting order  RANGE conditions (not in SPARQL) are difficult!  FILTER conditions may be more expensive!

  39. Indexes forCommonlyUsed Triple Patterns Patterns with a single variable are frequent Example: Albert_Einstein invented ?x • Build clustered index over (s,p,o) Can also be used for pattern like Albert_Einstein ?p ?x Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6) • SPO, POS, OSP to cover all possible triplet patterns • SOP, OPS, PSO to have all sort orders for patterns with two var’s (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) … • Lookup ids for constants:Albert_Einstein 16, invented 24 • Lookup known prefix in index:(16,24,0) • Read results while prefix matches:(16,24,567), (16,24,876)come already sorted! All triples in(s,p,o) order B+ treeforeasy access Triple table no longer needed, all triples in each index

  40. WhySort Order MattersforJoins   • When inputs sorted by joinattribute, use Merge Join: • sequentially scan both inputs • immediately join matching triples • skip over parts without matches • allows pipelining • When inputs are unsorted/sortedby wrong attribute, use Hash Join: • build hash table from one input • scan other input, probe hash table • needs to touch every input triple • breaks pipelining HJ MJ (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) (27,18,133)(50,134,1056)(16,56,1345)(24,16,1353) (47,37,20495)(16,33,46578) (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) (16,33,46578)(16,56,1345)(24,16,1353)(27,18,133)(47,37,20495)(50,134,1056) In general, Merge Joins are more preferable: small memory footprint, pipelining

  41. RDF-3x: Even More Indexes! SPARQL 1.0 considers duplicates (unless removed with DISTINCT) but does not (yet) support aggregates/counting • often queries with many duplicates like SELECT ?x WHERE ?x ?y Germany.to retrieve entities related to Germany (but counts may be important in the application!) • this materializes many identical intermediate results Solution: even more redundancy! • Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,OExample: SO contains, for each pair (s,o), the number of triples with subject s and object o • Do not materialize identical bindings, but keep counts Example:?x=Albert_Einstein:4; ?x=Angela_Merkel:10 • 15 indexes overall (all SPO permutations + their unique subsets)

  42. RDF-3x: Compression Scheme for Triplets • Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O • Step 1: compute per-attribute deltas • Step 2:variable-byte encoding for each delta triple 1-13 bytes (16,19,5356)(16,24,567)(16,24,676)(27,19,643)(27,48,10486)(50,10,10456) (16,19,5356)(0,5,-4798)(0,0,109)(11,-5,-34)(0,29,9843)(23,-38,-30)  gapbit header(7 bits) Delta ofvalue1(0-4 bytes) Delta of value 2(0-4 bytes) Delta of value 3(0-4 bytes) When gap=1, the delta of value3 isincluded in header, all others are 0 Otherwise, headercontainslengthofencodingforeachofthethreedeltas (5*5*5=125 combinationsstored in 7 bits) Many variants exist; this one is designed for triplets…

  43. CompressionEffectivenessvs. Efficiency • Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.) • Much faster (10x) for decompressing • Example for Barton dataset [Neumann & Weikum: VLDB’10]: • Raw data 51 million triples, 7GB uncompressed (as N-Triples) • All 6 main indexes: • 1.1GB size, 3.2s decompression with byte-level encoding • Optionally: additional compression with LZ77 2x more compact, but much slower to decompress • Compression always on page level

  44. Back to the Example Query Projection       ?a MJ ?u MJ PSO(works_for,?u,?b) ?u,?a POS(pdh_from,?u,?a) POS(works_for,?u,?a) SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, “Saar”)) Projection 250 250 250 ?a POS(teaches,?a,?t) HJ 2500 ?a,?u 250 100 MJ PSO(phd_from,?a,?u) ?u 50 POS(teaches,?a,?t) 5 Filter regex(?u,“Saar“) 1000 1000 POS(works_for,?u,?b) 50 1000 Filter regex(?u,“Saar“) POS(works_for,?u,?a) 100 1000 Core ingredients of a good query optimizer areselectivity estimators for triple patterns (index scans) and joins Which of the two plans is better? How many intermediate results?

  45. RDF-3x: Selectivity Estimation How many results will a triple pattern have? Standard databases: • Per-attribute histograms • Assume independence of attributes  Use aggregated indexes for exact count Additional join statistics for triple blocks (pages): too simplisticand inexact … (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) … Assume independence between triple patterns; additionally precompute exact statistics for frequentpaths in the data

  46. Handling Updates What should we do when our data changes? (SPARQL 1.1 has updates!) Assumptions: • Queries far more frequent than updates • Updates mostly insertions, hardly any deletions • Different applications may update concurrently Solution: Differential Indexing

  47. RDF-3x: Differential Updates Staging architecture for updates in RDF-3X Workspace A: Triplesinserted byapplication A completion of A Workspace B: Triplesinserted byapplication B completion of B on-demand indexes at query time kept in mainmemory • Deletions: • Insert the same tuple again with “deleted” flag • Modify scan/join operators: merge differential indexes with main index

  48. Outline for Part I Part I.1: Foundations Introductionto RDF A shortoverviewof SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

  49. Principles Observations and assumptions: • Not too many different predicates • Triple patterns usually have fixed predicate • Need to access all triples with one predicate • Design consequence: • Useonetwo-attribute tableforeachpredicate

  50. Example: Columnstores works_for subjectobject ex:Katjaex:MPI_Informatics ex:Martinex:MPI_Informtatics ex:Ralfex:Saarland_University ex:Ralfex:MPI_Informatics teaches subject object ex:Katja ex:Databases ex:Martin ex:Databases ex:Ralf ex:Information_Retrieval PhD_from subjectobject ex:Katjaex:TU_Ilmenau ex:Martinex:Saarland_University ex:Ralfex:Saarland_University ex:Katjaex:teachesex:Databases; ex:works_forex:MPI_Informatics; ex:PhD_fromex:TU_Ilmenau. ex:Martinex:teachesex:Databases; ex:works_forex:MPI_Informatics; ex:PhD_fromex:Saarland_University. ex:Ralfex:teachesex:Information_Retrieval; ex:PhD_fromex:Saarland_University; ex:works_forex:Saarland_University, ex:MPI_Informatics.

More Related