1 / 32

Cost-based Optimization of Graph Queries

IDAR 2007. Cost-based Optimization of Graph Queries. Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics. Motivation – Biological Networks. …. TYPE. Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion. Sequence.

nevaeh
Download Presentation

Cost-based Optimization of Graph Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IDAR 2007 Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics

  2. Motivation – Biological Networks … TYPE Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Sequence Function Name Location from http://www.genome.jp/kegg Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  3. A P B Querying Networks - PQL • Pathway Query Language (PQL) [Leser, 2005] • Syntax for querying graphs • Find subgraphs matching the query graph Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Find all enzymes that are directly or indirectly affected by „Glucose“ name = Glucose ISA compound SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; ISA enzyme Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  4. A Node Conditions • Nodes can contain conditions on Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion root interaction molecule name = Glucose ISA compound macro- molecule compound catalysis inhibition P protein ion gene mRNA sugar B ISA enzyme enzyme query TYPE hierarchy - partially Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  5. a A name = Glucose ISA compound b P B ISA gene query Path Conditions • Paths can contain conditions on Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion graph Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  6. A Result of Graph Queries • Search for matching subgraphs • Find node and path bindings for the query variables in the network Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion name = Glucose ISA compound P B ISA enzyme query network Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  7. Outline • Motivation • Optimize Graph Queries • Evaluate node conditions • Evaluate path conditions • Future Work • Relational algebra for graph queries • Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  8. A ⋈Node.TYPE=TYPE χcompound σname=Glucose name = Glucose ISA compound Node TYPE P query plan for node A B ISA gene query Evaluation of Node Conditons • Node attributes • Select operator (σ) on Node table • Node types, functions, and locations • Hierarchy operator (χ) • Return the specified concept and all successor concepts Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  9. a b How to evaluate Path conditions? • Recursively traverse the graph • Edge • Arbitrary number of joins • No possibility to optimize the execution Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion ⋈Edge ⋈… ⋈Edge ⋈Edge ⋈Edge graph Need for new logical and physical operators Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  10. A name = Glucose ISA compound P B ISA gene query Path Existence Operator, Φ • Node variablesAand B • Set of nodes V bound to A • Set of nodes W bound to B • Path variableP • Condition on P: path from A to B • AΦB returns the set of nodepairs (v,w) for which paths from vV to wWin Gexist. Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  11. Physical Implementation of Φ • Graph traversal at query time • Breadth-first or depth-first search • Query precomputed index structure • Transitive closure (only for small graphs) • GRIPP[Trißl et al., 2007] • GRIPP index table, IND(G) • oneinstance for every node v in G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  12. ,17] ,13] GRIPP Index Creation • Depth-first traversal of G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • We reach a node v • for the first time • add tree instance of v to IND(G) • proceed traversal • again • add non-tree instance of v to IND(G) • do not traverse child nodes of v R [0 ,21] [1 ,20] A [16 ,19] ,7] [2 [10 ,9] [8 C D B [12 G H F E ,6] [5 ,18] [11 [15 [3 ,4] ,14] Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  13. ,13] GRIPP Index Table, IND(G) • Is node C reachable from node D? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion R [0 ,21] [1 ,20] A [16 ,17] ,19] ,7] [2 [10 ,9] [8 C C D D B [12 G H F E ,6] [5 ,18] [11 [15 [3 ,4] ,14] GRIPP index, IND(G) Graph, G Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  14. w reachable from v iff vpre < wpre < vpost Order Tree, O(G) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Order tree, O(G) Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  15. w reachable from v iff vpre < wpre < vpost Order Tree, O(G) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Order tree, O(G) Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  16. Query strategy – Step 1 • Retrieve the reachable instance set of start node v, called RIS(v) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Retrieve RIS(D) • Requires only a single query on IND(G) • If C RIS(D) • return true • stop the search • Else • proceed to Step 2 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  17. Query strategy – Step 2 • Search for non-tree instances in RIS(v) • The nodes of these instances are hop nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Check every i RIS(D) • If i is tree instance • [G and H] • Done • If i is non-tree instance • [A and B] • i has no successors in O(G), but possibly in G • proceed to Step 3 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  18. Query strategy – Step 3 • Extend the search • using hop nodes v1, …, vn Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Obtain the tree instance of node B • Proceed to Step 1 • Repeat steps 1…3 until • an instance of node C is found • or no more hop nodes are available Depth-first traversal of O(G) using hop nodes Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  19. A B GRIPP – Sets of Nodes P Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Graph, G R A C D B G H F E Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  20. GRIPP – Sets of Nodes • Two different strategies • Single node pair • Evaluate reachability for every node pair separately • Set-oriented • Evaluate reachability for the set in one step Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  21. Query GRIPP – Single Node Pair • First evaluate reachability(D,E) • Then reachability(D,C) separately Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion true true Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  22. Query GRIPP – Set-oriented • First query the order tree completely • Then search used nodes and target nodes • If preUsed < preTarget < postUsed true Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Used nodes Target nodes true true Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  23. Cost model • Single node pair strategy • query time linear in size of target set • better for few target nodes • Set-oriented strategy • almost constant query times • better for many target nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  24. Outline • Motivation • Optimize Graph Queries • Evaluate node conditions • Evaluate path conditions • Future Work • Relational algebra for graph queries • Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  25. Future Work • Towards an algebra for graph queries • Define new operators • Logical • Physical • Determine cost functions • Estimate the size of result sets • Define rewrite rules • Which operations can be pushed? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  26. a b Future Work – New Operators • Path length operator • Evaluate the length of a path • Possible solution • Store parts of paths – e.g., up to length x [Giugno & Shasha, 2002] Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion graph Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  27. Future Work • Cost Model • Assign cost models to physical operators • Estimate the size of result sets • Between how many node pairs does a path exist? – Possibly of certain length? • Possible solution • Sampling Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  28. πB A Φ ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE χcompound χenzyme σname=Glucose Node Node TYPE TYPE Rewrite Query Plan Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; name = Glucose ISA compound P B ISA enzyme query Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  29. πB ⋈B.TYPE=TYPE πB χenzyme Φ Φ TYPE ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE Node χcompound χenzyme σname=Glucose Node χcompound σname=Glucose Node TYPE TYPE Node TYPE Better Plan? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion 2 ,000 18 1 2,000 1 20,000 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  30. Conclusion • Optimize the execution of graph queries • Use cost-based query optimization • Extend relational algebra • New operators • Path existence operator, Φ • Path length operator • Cost functions • Estimate the size of result sets • Rewrite rules Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

  31. IDAR 2007 Thanks for your attentionSpecial thanks to my PhD supervisor Ulf Leser Silke Trißl Humboldt-Universität zu Berlin Work sponsored by

  32. References • U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl 2:ii33–ii39, Sep 2005. • B. Eckman and P. G. Brown Graph data management for molecular and cell biology. IBM J. Res & Dev., 50(6):545 – 560, Nov 2006. • F. Sohler and R. Zimmer. Identifying active transcription factors and kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep 2005. • J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, 1999. Morgan Kaufmann. • V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, 2003. IEEE Computer Society. • S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, 2007. ACM Press. Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

More Related