430 likes | 828 Views
Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan Overview Motivation Problem Introduction Background
E N D
Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan
Overview • Motivation • Problem • Introduction • Background • Covering Index Definition Scheme • Performance Study • Conclusion
Motivation • Covering index is a well-known technique in relation database systems • Define an index that “cover” all attributes of a table that are referenced in a query • Evaluate query without the table • Speed up query performance • Can covering index used to accelerate the branching path queries? • Yes
Problem • The existing index are large in practice • DataGuide • 1-Index • Forward and Backward Index (F&B Index)
The Labeled Graph Data Model • Model XML or semi-structured data as a directed, node-labeled tree with extra set of special edges called idrefedges • Directed graph
Branching Path Expressions • Forward and Backward Separators • If ni and ni+1 are separated by a • /: then ni is the parent of ni+1 • //: then ni is the ancestor of ni+1 • : then ni points to ni+1 through an idref edge • \: then ni is the child of ni+1 • \\:then ni is the descendant of ni+1 • : then ni is poined byni+1 through an idref edge
Branching Path Expressions • Label-path • A sequence of labels l1, l2,…lp separated by the separators • Node-path • A sequence of nodes n1,n2,…np separated by the separators • A node-path matches a label-path if the corresponding separators are the same and label(ni) = li
Branching Path Expressions • Primary path is the path that remains when all parts between brackets “[” and “]” are removed. • Example: Root/metro/neighorhoods/neighbornood[/business hotel]/cultural museum
Index Graph • Index Graph I(G), where G is the data graph • A is the node in I, ext(A), the extent of A, is the subset of VG • Query result • A branching path expression P on I(G) • Union of the extents of the index nodes that result from evaluating P on I(G)
Bisimularity • Definition: a symmetric, binary relation on VG is called a bisimulation if, for any two data nodes u and v with u v, we have that: • u and v have the same label • If paru is the parent of u and parv is the parent of v, then paru parv • If u’ points to u through an idref edge, then there is a v’ that points to v through an idref such that u’ v’, and vice-versa.
DataGuide • Concise and accurate structural summaries of semi-structured databases
1-index • Index graph which is constructed on data graph G using bisimulation • Intuition: try to group together nodes if they have the same incoming paths
Forward and Backward index • Construct F&B-Index on edge-labeled data graph • For every (edge) label l, add a new label l-1 • For every edge e labeled l from node u to node v, add an (inverse) edge e-1 with label l-1 from v to u • Compute the 1-Index (or DataGuide) on this modified graph
Succ-Stable and Pred-Stable • For a set of nodes A, Let Succ(A) denote the set of successors of the nodes in A. • Given two sets of data graph nodes A and B, A is said to be succ-stable with respect to B if either A is a subset of Succ(B) or A and Succ(B) are disjoint • Pred-stable can be defined in the same way
Stability • If A is succ-stable with respect to B and there is an edge from B to A, then every note in extent of A has a parent in the extent of B • Important for precision of index graph • Stabilize A and B • Splite A into A1 and A2 • A1 is A succ(B) • A2 is A – succ(B) • 1-Index • Initialization by label grouping • Splitting the label grouping till we obtain succ-stable refinement
Another View of F&B-Index • Another way to build F&B-Index • Reverse all edges in G • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Reverse edges in G again • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Repeat the above steps till the current partition does not change • Obtain a partition of the data nodes that is both succ-stable and pred-stable
Size of the F&B-Index • F&B-Index over a data graph G covers all branching path expressions over G • Any index graph that covers all branching path expressions over G must be a refinement of F&B Index • F&B-Index is the smallest index graph that covers all branching path expressions over G • F&B-Index is often big. It can approach the size of the base data itself
Covering Index Definition Scheme • Eliminating branching path expressions which are deemed less important. • Smaller index handling the remaining branching queries more efficiently • Four approaches towards the goal • Tags to be indexed • Tree edges vs idref edges • Exploiting local similarity • Restricting tree depth
Tags to be indexed • Tags that never queried • Need not be indexed • Alter the label with a unique label: other • If not in the tree path to any node that is indexed, it can be assumed to be absent • Can have a lot of effect in practice • XMark data, 100MB(1.43M nodes) • F&B-Index has 436000 nodes • Ignore text tags such as bold and emph • Number of nodes drops to 18000
Tree Edges vs idref Edges • Effect of idref edges • XMard data • F&B-Index on tree edges and idref edges has 1.35M nodes (ignore text nodes) • F&B-Index on only tree edges has 18000 nodes (ignore text nodes) • Give tree edges priority • Specify the set of idref edges to be indexed
Exploiting Local similarity • Observations: • Most queries refer to short paths and seldom ask for long paths • Two nodes are locally similar, but they may be stored in different extents due to a variety of complex paths • Exploiting local similarity • Give up absolute precision and group similar pieces of data together • A(k)-Index
K-bisimulation • Definition: k (k-bisimilarity) is defined inductively • For any two nodes, v and v, u 0 v iff u and v have the same label • Node u kv iff u k-1v, paru k-1 parv • For every u’ that points to u through an idref edge, there is a v’ that points to v through an idref edge such that u’ k-1 v’, and vice versa
A(k)-index • Constructed on data graph G using k-bisimulation • Precise for any simple path expression of length less than or equal to k • Use k to control the size of the index and the maximum area of the index graph affected • Increasing k refines the partition until a fixed point is reached, which is 1-Index.
Restricting Tree Depth • Tree Depth • Given a branching path expression • All nodes that do not have tree-depth 0 • Nodes that have a path from some node in the primary path have tree-depth 1 • Nodes that do not have tree-depth 1 and have a path to some node of tree-depth 1 have tree-depth 2 • Nodes that do not have tree-depth 2 and have a path from some node of tree-depth 2 have tree-depth 3 • And so on… • Tree depth of a query is the maximum tree-depth of its nodes
Tree Depth Example • Query example • //museums/history/museum[/featured and cultural\neighborhood [/cultural museum [\art]]] • asks for history museums that have a featured exhibit and also have an art museum in the same neighborhood
F+B-Index • Consider one iteration of F&B-Index Computation • Reverse all edges in G. • Compute the bisimilarity partition • Reverse edges in G again • Compute the bisimilarity partition • Call this index graph F+B-Index • F+B+F+B-Index: two iteration
F+B-Index • F+B-Index is accurate for branching path expressions that have tree depth at most 1 • F+B+F+B-Index is accurate for branching path expressions that have tree depth at most 3 • Can not handle all the queries • Meaningful queries are often with small tree depth
Putting it together • Index definition • A set of tags T to be indexed. • For each of the forward and backward didrecions • Set of idref edges to be indexed (denote as reffwd and refback) • The extent of local similarity desired (denote as kfwd and kback) • Tree depth td, the number of iterations in the F&B-index computation to be performed
Example • Tags to be indexed • ROOT, metro, cinema-hall, neighborhoods, neighborhood, business • Local similatiry • kfwd= kback = ∞ • td = ∞ ROOT metro business neighborhoods neighborhood neighborhood Cinema-halls 9,10 business Cinema-hall business 24,26
Index Selection • Given query • The tag should be indexed • kfwd≥ path length of the query • kback ≥ path length of the query • td ≥ tree depth of the query • More generic index, more queries coverd, worse performance we get. • Depends heavily on the data and the queries
Performance study • XMark XML benchmark dataset • Models an auction site
Performance on Queries • Use defn 5,6,8, called Iall, Ialmost-alland Ispecific • Use 5 different queries • Some index may not cover the queries due to the reduction • Three scenarios • RELSTORE: stored in relational system • NSTORE: stored using a native storage engine • RELPUBLISH: stored in relation system and queries are over an XML view of data
Performance on Queries (a) (b) (a): RELSTORE (b): NSTORE (c): RELPUBLISH (c)
Conclusion • Covering indexes are a promising approach to their efficient evaluation • F&B-Index can be a covering index for all set of branching path queries, but the size of the index is to big in practice • Using scheme definition, we can get much smaller covering indexes that cover certain classes of queries