540 likes | 659 Views
gStore: Answering SPARQL Queries Via Subgraph Matching. 1 Peking University, 2 Hong Kong University of Science and Technology, 3 University of Waterloo. Lei Zou 1 , Jinghui Mo 1 , Lei Chen 2 , M. Tamer Özsu 3 , Dongyan Zhao 1. Outline. Background & Related Work Overview of gStore
E N D
gStore: Answering SPARQL Queries Via Subgraph Matching 1Peking University, 2Hong Kong University of Science and Technology, 3University of Waterloo Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer Özsu3, Dongyan Zhao1
Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS*-tree & Query Algorithm • Experiments • Conclusions
Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS*-tree & Query Algorithm • Experiments • Conclusions
Semantic Web “Semantic Web Technologies” is a collection of standard technologies to realize a Web of Data.
RDF Data Model URI Literals URI
RDF Graph Literal Vertex Entity Vertex
SPARQL Queries SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Query Graph
Naïve Triple Store SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Too many Self-Joins SQL: Select T3.Subject From T as T1, T as T2, T as T3 Where T1.Predict=“BornOnDate” and T1.Object=“1809-02-12” and T2.Predict=“DiedOnDate” and T2.Object=“1865-04-15” and T3. Predict=“hasName” and T1.Subject = T2.Subject and T2. Subject= T3.subject
Existing Solutions Three categories of solutions are proposed to speed up query processing: • Property Table; Jena [K. Wilkinson et al. SWDB 03], … 2. Vertically Partitioned Solution; SW-store [D. J. Abadi et al. VLDB 07],… 3. Exhaustive-IndexingRDF-3x [T. Neumann et al. VLDB 08], Hexastore [C. Weiss et al. VLDB 08 ],…
Existing Solutions-Property Table SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Reducing # of join steps SQL: Select People.hasName from People where People.BornOnDate = “1809-02-12” and People.DiedOnDate = “1865-04-15”.
Existing Solutions-Vertically Partitioned Solution Fast Merge Join
Existing Solutions- Exhaustive-Indexing Range query & Merge Join Each SPARQL query statement can be translated into one “range query”. SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }
Some Limitations • Difficult to handle ``wildcard queries’’. • Difficult to handle updates.
Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS*-tree & Query Algorithm • Experiments • Conclusions
Intuition of gStore Finding Matches over a Large Graph is not a trivial task.
Preliminaries Literal Vertex Entity Vertex
Preliminaries • RDF graph
Preliminaries • Query Graph
Preliminaries • match
Preliminaries • Problem definition
Storage Schema in gStore Encoding all neibhors into a “bit-string”, called signature.
Encoding Technique (1) • |eSig(e).e| = M. • we employ m different string hash functions Hi (i = 1, ...,m) • For each hash function Hi, we set the (Hi(eLabel) MOD M)-th bit in eS ig(e).e to be ‘1’ • Encoding Sig(e).n is the same • |eSig(e).n| = N • n different hash functions
Encoding Technique (2) “Abr”, “bra”, ”rah”, ”aha”, …., 0000 0010 0000 0000 ( hasName, “Abraham Lincoln”) 1000 0000 0000 0000 0010 0000 0000 1000 0010 0100 0001 0000 0000 0100 0000 ( BornOnDate, “1809-02-12”) 0100 0000 0000 0100 0010 0100 1000 0000 0000 0000 0001 OR ( DiedOnDate, “1865-04-15”) 1000 0010 0100 0001 0000 1000 0000 0000 0010 0100 0000 OR ( DiedIn, “y:Washington_D.c”) 0110 1010 0000 1100 0010 0100 1001 0000 0010 0000 1000 0010 0100 0001
Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS-tree & Query Algorithm • Experiments • Conclusions
A Straightforward Solution (1) u2 u1 L1 L2
A Straightforward Solution (2) L1 L2 Large Join Space !
Pruning Technique Reduced Join Space! u2 u1 10010
Optimized method • Too many super edges • Which level to start search • No brute-force enumeration
VS*-Tree Insert • The criterion in the VS-tree only depends on the Hamming distance between the signatures of u and the node in VS-tree. • the criterion in VS∗- tree depends on both node signatures and G∗’s structure
VS*-Tree split • the B+1 entities of the node will be partitioned into two new nodes, where B is the maximal fanout for a node in VS∗-tree. • 1. we find two entities that have the maximal Hamming distance between them as two seed nodes • 2. we associate each left entry with the nearest seed node, according to Equation 1.
VS*-Tree deletion • Similar to split • if some node d has less than b entries, where b is the minimal fanout of node in VS∗-tree, then d is deleted and its entries are reinserted into VS∗-tree.
Updates- Deletion in VS*-tree To be deleted
Which Level To Begin • a concept “pruning power” of GIwith regard to Q∗ denoted as P(Q∗,GI)
Finding Valid Child States • propose a DFS strategy to find all valid child states of J. • start a DFS over G∗ beginning from some vertex vi
Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS*-tree & Query Algorithm • Experiments • Conclusions