320 likes | 485 Views
IDAR 2007. Cost-based Optimization of Graph Queries. Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics. Motivation – Biological Networks. …. TYPE. Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion. Sequence.
E N D
IDAR 2007 Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics
Motivation – Biological Networks … TYPE Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Sequence Function Name Location from http://www.genome.jp/kegg Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A P B Querying Networks - PQL • Pathway Query Language (PQL) [Leser, 2005] • Syntax for querying graphs • Find subgraphs matching the query graph Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Find all enzymes that are directly or indirectly affected by „Glucose“ name = Glucose ISA compound SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; ISA enzyme Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A Node Conditions • Nodes can contain conditions on Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion root interaction molecule name = Glucose ISA compound macro- molecule compound catalysis inhibition P protein ion gene mRNA sugar B ISA enzyme enzyme query TYPE hierarchy - partially Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
a A name = Glucose ISA compound b P B ISA gene query Path Conditions • Paths can contain conditions on Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion graph Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A Result of Graph Queries • Search for matching subgraphs • Find node and path bindings for the query variables in the network Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion name = Glucose ISA compound P B ISA enzyme query network Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Outline • Motivation • Optimize Graph Queries • Evaluate node conditions • Evaluate path conditions • Future Work • Relational algebra for graph queries • Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A ⋈Node.TYPE=TYPE χcompound σname=Glucose name = Glucose ISA compound Node TYPE P query plan for node A B ISA gene query Evaluation of Node Conditons • Node attributes • Select operator (σ) on Node table • Node types, functions, and locations • Hierarchy operator (χ) • Return the specified concept and all successor concepts Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
a b How to evaluate Path conditions? • Recursively traverse the graph • Edge • Arbitrary number of joins • No possibility to optimize the execution Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion ⋈Edge ⋈… ⋈Edge ⋈Edge ⋈Edge graph Need for new logical and physical operators Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A name = Glucose ISA compound P B ISA gene query Path Existence Operator, Φ • Node variablesAand B • Set of nodes V bound to A • Set of nodes W bound to B • Path variableP • Condition on P: path from A to B • AΦB returns the set of nodepairs (v,w) for which paths from vV to wWin Gexist. Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Physical Implementation of Φ • Graph traversal at query time • Breadth-first or depth-first search • Query precomputed index structure • Transitive closure (only for small graphs) • GRIPP[Trißl et al., 2007] • GRIPP index table, IND(G) • oneinstance for every node v in G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
,17] ,13] GRIPP Index Creation • Depth-first traversal of G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • We reach a node v • for the first time • add tree instance of v to IND(G) • proceed traversal • again • add non-tree instance of v to IND(G) • do not traverse child nodes of v R [0 ,21] [1 ,20] A [16 ,19] ,7] [2 [10 ,9] [8 C D B [12 G H F E ,6] [5 ,18] [11 [15 [3 ,4] ,14] Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
,13] GRIPP Index Table, IND(G) • Is node C reachable from node D? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion R [0 ,21] [1 ,20] A [16 ,17] ,19] ,7] [2 [10 ,9] [8 C C D D B [12 G H F E ,6] [5 ,18] [11 [15 [3 ,4] ,14] GRIPP index, IND(G) Graph, G Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
w reachable from v iff vpre < wpre < vpost Order Tree, O(G) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Order tree, O(G) Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
w reachable from v iff vpre < wpre < vpost Order Tree, O(G) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Order tree, O(G) Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 1 • Retrieve the reachable instance set of start node v, called RIS(v) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Retrieve RIS(D) • Requires only a single query on IND(G) • If C RIS(D) • return true • stop the search • Else • proceed to Step 2 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 2 • Search for non-tree instances in RIS(v) • The nodes of these instances are hop nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Check every i RIS(D) • If i is tree instance • [G and H] • Done • If i is non-tree instance • [A and B] • i has no successors in O(G), but possibly in G • proceed to Step 3 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 3 • Extend the search • using hop nodes v1, …, vn Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion • Obtain the tree instance of node B • Proceed to Step 1 • Repeat steps 1…3 until • an instance of node C is found • or no more hop nodes are available Depth-first traversal of O(G) using hop nodes Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
A B GRIPP – Sets of Nodes P Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Graph, G R A C D B G H F E Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
GRIPP – Sets of Nodes • Two different strategies • Single node pair • Evaluate reachability for every node pair separately • Set-oriented • Evaluate reachability for the set in one step Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query GRIPP – Single Node Pair • First evaluate reachability(D,E) • Then reachability(D,C) separately Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion true true Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query GRIPP – Set-oriented • First query the order tree completely • Then search used nodes and target nodes • If preUsed < preTarget < postUsed true Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Used nodes Target nodes true true Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Cost model • Single node pair strategy • query time linear in size of target set • better for few target nodes • Set-oriented strategy • almost constant query times • better for many target nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Outline • Motivation • Optimize Graph Queries • Evaluate node conditions • Evaluate path conditions • Future Work • Relational algebra for graph queries • Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Future Work • Towards an algebra for graph queries • Define new operators • Logical • Physical • Determine cost functions • Estimate the size of result sets • Define rewrite rules • Which operations can be pushed? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
a b Future Work – New Operators • Path length operator • Evaluate the length of a path • Possible solution • Store parts of paths – e.g., up to length x [Giugno & Shasha, 2002] Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion graph Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Future Work • Cost Model • Assign cost models to physical operators • Estimate the size of result sets • Between how many node pairs does a path exist? – Possibly of certain length? • Possible solution • Sampling Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
πB A Φ ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE χcompound χenzyme σname=Glucose Node Node TYPE TYPE Rewrite Query Plan Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; name = Glucose ISA compound P B ISA enzyme query Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
πB ⋈B.TYPE=TYPE πB χenzyme Φ Φ TYPE ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE ⋈Node.TYPE=TYPE Node χcompound χenzyme σname=Glucose Node χcompound σname=Glucose Node TYPE TYPE Node TYPE Better Plan? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion 2 ,000 18 1 2,000 1 20,000 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Conclusion • Optimize the execution of graph queries • Use cost-based query optimization • Extend relational algebra • New operators • Path existence operator, Φ • Path length operator • Cost functions • Estimate the size of result sets • Rewrite rules Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
IDAR 2007 Thanks for your attentionSpecial thanks to my PhD supervisor Ulf Leser Silke Trißl Humboldt-Universität zu Berlin Work sponsored by
References • U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl 2:ii33–ii39, Sep 2005. • B. Eckman and P. G. Brown Graph data management for molecular and cell biology. IBM J. Res & Dev., 50(6):545 – 560, Nov 2006. • F. Sohler and R. Zimmer. Identifying active transcription factors and kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep 2005. • J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, 1999. Morgan Kaufmann. • V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, 2003. IEEE Computer Society. • S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, 2007. ACM Press. Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR