300 likes | 424 Views
VLDB 2012 ADDING LOGICAL OPERATORS TO TREE PATTERN QUERIES ON GRAPH STRUCTURED DATA. Authors: Qiang Zeng, Xiaorui Jiang, and Hai Zhuge The Speaker: Hai Zhuge Key Lab of Intelligent Information Processing Chinese Academy of Sciences. Query on Graph. Example: Query on DBLP XML document
E N D
VLDB 2012 ADDING LOGICAL OPERATORS TO TREE PATTERN QUERIES ON GRAPH STRUCTURED DATA Authors: Qiang Zeng, Xiaorui Jiang, and Hai Zhuge The Speaker: Hai Zhuge Key Lab of Intelligent Information Processing Chinese Academy of Sciences
Query on Graph • Example: Query on DBLP XML document • Get A’s conference papers published from 2000 to 2010 and co-authored with B • Get conference papers of either A or B published from 2000 to 2010. • Get A’s conference papers that are not co-authored with B published from 2000 to 2010. Query Graph data Tree
DBLP Graph pattern matching (i.e.,subgraph matching) : Given a data graph G and a pattern query q, identify “subgraphs” that match q in isomorphic semantics v1 u1 v2 … v3 edge edge edge path
DBLP • Graph pattern matching is a building block of many graph queries which are key to many applications • Social/biological networks analysis • program analysis • Information retrieval
GTPQ: Generalized Tree Pattern Query • Applications need more powerful semantics • Incorporating Boolean logic to patterns • Each node is associated with a distinct propositional variable • In addition to attribute predicates, each non-leaf node has a structural predicate fs in terms of propositional logic with variables corresponding to its children • Applications often need a part of nodes • allowing a portion of query nodes to be output nodes (full-fledged evaluation) paper • author1 or author2 • fs(u1)=pu2∨pu3 • author1 but not with author2 • fs(u1)=pu2∧¬pu3 Output the title only author1 author2 title Twig query
Previous Approaches • On tree structure unsuitable for graphs • Node encoding schemes unsuitable for graphs • Some extensions are also on tree structure • Minimization has been studied • On graph structure • Time and space costs are high • On graph pattern matching • No disjunction and negation operations • On query results • Most approaches concern complete result • Applications often request a portion of query as result
Fundamental Problems • Satisfiability • Answer to query on graph G, Q(G), is not empty • Containment, Equivalence • Q(G)Q’(G), Q(G)=Q’(G) • Based on homomorphism • Minimization • Find equivalent Q(G) with minimal number of nodes
Contributions • Proposed a new class of tree pattern queries over graph-structure data GTPQ • Proposed an approach to raise TPQ efficiency • a graph representation of intermediate results • a pruning approach for evaluating query patterns over graphs • Investigated fundamental problems • Satisfiability, containment, equivalence and minimization • Developed the algorithm GTPQ
Complexity analysis Satisfiability: A GTPQ is satisfiable if there is a data graph on which the answer to the query is non-empty. • Satisfiableiff the attribute predicate and the complete structural predicate of the root are both satisfiable • NP-Complete ¬ Containment • Q1 is contained in Q2iff there is a homomorphism from Q2 to Q1 • Containment problem: Co-NP-hard Output node neighborhood reachability
Complexity analysis Satisfiability: A GTPQ is satisfiable if there is a data graph on which the answer to the query is non-empty. • Satisfiableiff the attribute predicate and the complete structural predicate of the root are both satisfiable • NP-Complete Containment • Q1 is contained in Q2iff there is a homomorphism from Q2 to Q1 • Co-NP-hard Minimization • Remove all redundant query nodes • Case 1: those semantically contained by some others (containment problem) • Case 2: unsatisfiable subqueries (satisfiability problem) • Determine whether a query is minimal: NP-Hard
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller and simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) RJoin pattern query Use 2-hop to find the reachability pairs
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller and simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) • Complete Bipartite SJoins (HGJoin, VLDB’08) HGJoin pattern query Use Interval index to find the reachability pairs
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller, simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE) • Complete Bipartite SJoins (HGJoin, VLDB’08) • Pipelined joins on trees + Naïve on non-trees (VLDB’05, 12) A B Use “pools” Path/TwigStack Path/TwigStackD
Existing Approaches for Conjunctive TPQ • The index size is typically large. • In particular, #index(RJoin)=Ω(n2) • Produce large amounts of intermediate results • selectivity(query) << selectivity(substructures) • TwigStackD introduces a pre-filtering process, but it needs to scan the whole data graph. • TPQ with negation and disjunction ? • Decompose the pattern into a set of conjunctive TPQ and perform joins (again, involving producing many redundant intermedidate results) • Full-fledge evaluation? • Projection
GTEA: Evaluation algorithm Applying existing algorithms to process GTPQ • large amounts of intermediate results • not efficient for full-fledged evaluation • first find the results of the whole pattern and perform projection • The decomposition-based approach has rather low performance • has to decompose a query to several conjunctive sub-queries • Structural-join problems Our Approach: Stage 1: bottom-up and top-down pruning Stage 2: construct the Maximal Matching Graph (MMG) Stage 3: enumerate results via a graph traversal on MMG
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints Basic operation A u1 u2 • Use 3-hop to determine the reachability between two sets • Key idea: exploit the shared reachability using a substructure B We can also use other reachability index structures
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints Process a set of edges holistically
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph (MMG) • Represent intermediate results • Vs. tuple form • smaller space complexity • easier to derive final results v1 u1 w1 v1 v1 w1 w3 v1 u1 u3 u3 w3 v1 MMG
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph (MMG) • Represent intermediate results • Vs. tuple form • smaller space complexity • easier to derive final results • Similar ideas are also used in several other studies for representing the final results. (able to reduce the query complexity)
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph • Represent intermediate results • Optimized for non-output nodes • GTPQ Prime Subtree (2nd pruning) Shrunk Prime Subtree (MMG) output node
GTEA: Experimental study Datasets arXIv data: 9562 nodes and 28120 edges XMark data: 0.64M ~ 5.17M nodes, 0.77M ~ 6.20M edges Algorithms Algorithms for tree-structured data: TwigStack, Twig2Stack Algorithms for graph-structured data: TwigStackD, HGJoin, GTEA Experiments • The efficiency and scalability for processing conjunctive queries • The expected I/O costs • The impact of adding negation and disjunction on performance • The effectiveness of the pruning process
GTEA: Experimental study • Better even for conjunctive queries • MMG approach is effective
GTEA: Experimental study • The size of intermediate results is small
GTEA: Experimental study • Optimization for non-output results • The performance gap is significantly widened especially when the query has negation operations
Summary • Explore a new tree pattern matching query with Boolean logic on graph-structured data • Structural predicate, output nodes • Analyze computational complexities of four problems for static global optimization • Satisfiability, containment and equivalence, minimization • The first study on these problems • Propose an algorithm GTEA • Pruning approach using 3-hop • Optimization for non-output nodes • Maximal matching graph
Future Work • Query over Semantic Link Network • Different from RDF • Real-world applications • New conditions and requirements Query Relational rules: parentOf fatherOf V motherOf childOf sonOf V daughterOf H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition A simple Semantic Link Network
Incorporating the Semantic Space H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition
Problems • System • Interface • Application • Automatically generating semantic link networks • Semantics • Understand query and patterns Irrelevant to size Semantics? Query Graph Graph
References on Semantic Link NetworkConcern AI and Database • H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition. • Chapter 2. The Semantic Link Network • H.Zhuge, The Web Resource Space Model, Springer, 2008. • H.Zhuge, Semantic linking through spaces for cyber-physical-socio intelligence: A methodology, Artificial Intelligence, 175(2011)988-1019. • H.Zhuge, Communities and Emerging Semantics in Semantic Link Network: Discovery and Learning, IEEE Transactions on Knowledge and Data Engineering, vol.21, no.6, 2009, pp. 785-799. • H.Zhuge, Interactive Semantics, Artificial Intelligence, 174(2010)190-204.