1 / 39

Agenda

gStore : Answering SPARQL Queries via Subgraph Matching Lei Zou , Jinghui Mo , Lei Chen , M. Tamer Ozsu ¨ , Dongyan Zhao { zoulei,mojinghui,zdy }@icst.pku.edu.cn, leichen@cse.ust.hk, tamer.ozsu@uwaterloo.ca. Agenda. Introduction Preliminaries Overview of gStore

karr
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gStore: Answering SPARQL Queries via Subgraph MatchingLei Zou, JinghuiMo , Lei Chen , M. Tamer Ozsu¨, DongyanZhao{ zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk, tamer.ozsu@uwaterloo.ca

  2. Agenda • Introduction • Preliminaries • Overview of gStore • Storage Scheme and Encoding Technique • Indexing Structure and Query Algorithm • Optimized methods • Experiments and their results • Conclusions

  3. Introduction -1/4 • What is RDF? • Building block of semantic web • Represented as a collection of triples : (Subject,Property,Object) Prefix: y=http://en.wikipedia.org/wiki/

  4. Introduction 2/4:RDF Graph

  5. Introduction - 3/4 • What is SPARQL? • Sample query: Select ?name Where {?m <hasName>?name. ?m <BornOn Date > “1809-02-12” ?m <DiedOnDate> “1865-04-15” } • Query with wildcards: Select ?name Where {?m <hasName>?name. ?m <BornOnDate> ?bd. ?m <DiedOnDate> ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd),“04-15”) }

  6. Introduction - 4/4 • Problems with existing solutions: • they cannot answer SPARQL queries with wildcards in a scalable manner • they cannot handle frequent updates in RDF repositories • Answering with subgraph matching • Modeling RDF data and Query as two graphs • Cannot use regular graph pattern matching • Answering SPARQL query ≈ subgraph matching

  7. Preliminaries • RDF graph , G, is denoted as G=(V, LV, E, LE ) • Query graph , Q, is denoted as Q=(V, LV, E, LE)

  8. Preliminaries Cont’d • G(u1, u2,…, un) is a match of Q(v1, v2,…,vn) if: • vi is a literal vertex, vi and ui have the same literal value • vi is a class/entity vertex, vi and ui have the same URI • vi is a parameter vertex, there is no constraint over ui • vi is a wildcard vertex, vi is a substring of uiand ui is a literal value • there is an edge from vi to vj in Q with the property p, there is also an edge fromui touj in G with the same property p

  9. Overview of gstore • Work directly on RDF graph and SPARQL Query graph • Use a signature-based encoding of each entity and class vertexto speed up matching • Filter and evaluate • Use a false-positive algorithm to prune nodes and obtain a setof candidates; then verify each candidate • Use an index (VS∗-tree) over the data signature graph(has light maintenance load) for efficient pruning

  10. Storage Scheme & Encoding Technique • Storage Scheme

  11. Storage Scheme & Encoding Technique • Encoding technique (hasName, “Abraham Lincoln”)

  12. Storage Scheme & Encoding Technique • Encoding technique “Abr” (hasName, “Abraham Lincoln”) “bra” “rah”

  13. Storage Scheme & Encoding Technique • Encoding technique “Abr” (hasName, “Abraham Lincoln”) “bra” “rah”

  14. Storage Scheme & Encoding Technique • Encoding technique “Abr” (hasName, “Abraham Lincoln”) “bra” “rah” OR

  15. Storage Scheme & Encoding Technique • Encoding technique (hasName, “Abraham Lincoln”)

  16. Storage Scheme & Encoding Technique • Encoding technique (hasName, “Abraham Lincoln”) OR (BornOnDate, "1908-02-12") (DiedOnDate, "1965-04-15") (DiedIn, y:Washington DC)

  17. Indexing Structure and Query Algorithm

  18. Data Signature Graph G*

  19. Converting Q to Q*

  20. Filter and Evaluate

  21. Generating Candidate List(CL) • Two step process: • for each vertex vi∈ V (Q∗ ), we find a list Ri= {ui1 , ui2 , ...,uin}, where vi&ui=vi,ui∈ V(G*) and uij ∈ Ri • do a multi-way join to get the candidate list • Use S-trees • Height-balanced tree over signatures • Does not support second step - expensive • Vs-tree and Vs*-tree • Multi-resolution summary graph based on S-tree • Supports both steps efficiently

  22. S-tree Solution 10000 1000 0000 0000 1000 1111 1101 d13 1110 1101 1001 1101 d12 d22 d33 d23 1001 0101 0010 1001 1100 0100 1001 1000 d43 d13 002 003 004 001 0001 1000 0010 1000 1000 0100 1000 0001 005 008 006 007 0001 0100 1000 1000 0000 0001 0100 0100

  23. S-tree Solution 001 10000 1000 0000 0000 1000 004 006 1111 1101 d13 1110 1101 1001 1101 d12 d22 d33 d23 1001 0101 0010 1001 1100 0100 1001 1000 d43 d13 002 003 004 001 0001 1000 0010 1000 1000 0100 1000 0001 006 005 008 007 0001 0100 0000 0001 1000 1000 0100 0100

  24. S-tree Solution 001 002 10000 004 003 0000 1000 1000 0000 006 006 1111 1101 d13 1001 1101 1110 1101 d12 d22 d23 d33 1001 1000 0010 1001 d43 d13 1100 0100 1001 0101 001 002 003 004 0010 1000 1000 0100 1000 0001 0001 1000 005 008 006 007 1000 1000 0100 0100 0000 0001 0001 0100

  25. S-tree Solution 001 002 10000 004 003 0000 1000 1000 0000 006 006 1111 1101 d13 1001 1101 1110 1101 d12 d22 d23 d33 0010 1001 1001 1000 d43 d13 1100 0100 1001 0101 001 002 003 004 0010 1000 1000 0100 1000 0001 0001 1000 005 008 006 007 1000 1000 0100 0100 0000 0001 0001 0100

  26. S-tree Solution 001 002 10000 & 1000 0000 0000 1000 004 003 006 006 1111 1101 d13 1110 1101 1001 1101 d12 d22 d33 d23 1001 0101 0010 1001 1100 0100 1001 1000 d43 d13 002 003 004 001 0001 1000 0010 1000 1000 0100 1000 0001 005 008 006 007 0001 0100 1000 1000 0000 0001 0100 0100

  27. VS-tree Solution 11111 d11 10010 00110 01011 d22 1001 1101 1110 1101 d12 00010 10010 00100 00010 01000 0010 1001 d33 1001 0101 d13 d23 1100 0100 00010 1001 1000 d43 01000 002 001 003 00100 0010 1000 1000 0100 1000 0001 004 0001 1000 10000 00010 00001 00010 007 008 005 0000 0001 0100 0100 0001 0100 006 1000 1000 00010 00010

  28. VS-tree Solution 10000 0000 1000 1000 0000

  29. VS-tree Solution 10000 0000 1000 1000 0000 d11 X d11

  30. VS-tree Solution 10000 0000 1000 1000 0000 d12 X d12

  31. VS-tree Solution 10000 0000 1000 1000 0000 d13 X d23

  32. VS-tree Solution 10000 0000 1000 1000 0000 001 X 002

  33. VS-tree Solution-limitations 10000 0000 1000 1000 0000 If this level is dense, many summary matches => More search space Process each level step by step

  34. Possible Optimization Methods • “magically” know which level to begin with to minimize the number of summary matches • Use DFS(Depth First Search) to find the valid child nodes • While inserting vertices, consider not only the hamming distance but also the number of super edges introduced

  35. Optimization example

  36. Experimental results-Exact queries Yago network (20 million triples & size 3.1GB) Queries BigOWLIM RDF-3x SW-Store x-RDF-3x GRIN gStore

  37. Experimental results-Wildcard queries Queries SW-Store gStore RDF-3x BigOWLIM GRIN x-RDF-3x

  38. Conclusion • This approach: • Uses two novel indexes VS-tree and VS*-tree to speed up query processing • Was also to solve the two problems with existing solutions: • answers SPARQL queries with wildcards in a scalable manner • handle frequent and online updates in RDF repositories

  39. Questions?

More Related