160 likes | 187 Views
Twig2Stack is a methodology for efficient processing of Generalized Tree-Pattern (GTP) queries in XML documents. It minimizes intermediate results and optimizes query processing time. The approach eliminates redundant matches, improves memory usage, and handles grouping and duplicates effectively. Twig2Stack offers a hybrid approach, combining top-down and bottom-up strategies, resulting in a more streamlined XML query processing framework.
E N D
Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara
Background • XML • Hierarchical (tree) structured data • Provide flexibility to model semi-structured data • Widely accepted as universal data exchange format • Query over XML • XPath, XQuery [W3C] • Extensively used by many applications • Adopted by a number of commercial systems VLDB' 2006. Seoul, Korea
State-of-the-art: XML Query Processing Algebraic Approach Binary Structure Joins [Timber] – Large intermediate results Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing Holistic Approach ? TwigStack [Bruno, et. al] PathStack [Bruno, et. al] Twig2Stack VLDB' 2006. Seoul, Korea
Processing Generalized Tree Pattern (GTP) Queries Structural Joins Structural Outer Joins – Grouping Duplication Elimination a1 A //A//B a2 B b1 Our goal: Avoid ALL these! D C Sort a1 XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d //A/B a2 b2 b1 VLDB' 2006. Seoul, Korea
Motivation: PathStack [Bruno et.al] a1 a2 a2 b2 • Query: //A//B; Data: • Key observation: minimize intermediate results through compact representation of path matches, by • Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2 • Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 • TwigStack [Bruno et.al] minimizes intermediate results through: • Output only those path matches that are in final twig results • However, such optimality cannot be guaranteed [Choi, et.al] • Not helpful for processing GTP queries • Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? • Useful for processing GTP queries as well? a1 b1 b1 S[A] S[B] b2 VLDB' 2006. Seoul, Korea
Hierarchical Stack Encoding a1 a1 • Inter-node: //A//B • Can still use explicit edges • Intra-node: A • Matching elements forms a tree structure as well • Associate each query node with a hierarchical stack • Push element einto hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E • Matching can be determined when entire sub-tree of e seen • Require post-order document traversal a2 a2 a3 a4 a3 a4 HS[A] VLDB' 2006. Seoul, Korea
Twig2Stack: Running Example [1,20], 1 a1 A [2,15], 2 [16,19], 2 a2 b3 B a2 [17,18], 3 [3,14], 3 D C d3 HS[A] b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 b1 [8, 9], 6 b2 [6,7], 6 c1 d2 HS[B] Merging Stacks TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. Twig2Stack requires neither path joins nor path enumeration! d1 d2 d3 c1 c2 HS[D] HS[C] VLDB' 2006. Seoul, Korea
GTP Result Enumeration a4 • Bottom-up Computation .vs. Top-down Enumeration • Visit Only those that are in the twig matches • Handling grouping results • Automatic grouping through Inter-node edges • Handling duplicates and out-of-order results • Problems coming from non-return nodes • If D is return node while B is not • b1 → d1, d2, d3 and b2 →d2, d3 (duplicates) • Observation: Intra-node hierarchy provides hints b1 b2 d2 c1 c2 d1 d3 VLDB' 2006. Seoul, Korea
Experiment Setup • Implementation • Twig2Stack: Java 1.4.2 • TwigStack, TJFast: Java 1.4.2 • Kindly provided by Jiaheng Lu from National University of Singapore (NUS) • Datasets • XMark, DBLP, TreeBank • Metrics • Query processing time • IO time VLDB' 2006. Seoul, Korea
Processing Full Twig Queries Optimization of Query Processing: TwigStack Twig2Stack Optimization of IO: TJFast VLDB' 2006. Seoul, Korea
Not yet done: Memory Usage • Hierarchical Stack Encoding could hold entire document in memory in the worst case • Unlike DOM approach, only matches need to be stored • Tag match • (Partial) twig match • Predicate evaluation • Early result enumeration dramatically reduces the memory usage • Enumerate query results before the end of document and release buffer • Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack) approaches VLDB' 2006. Seoul, Korea
S[A] HS[A] S[B] HS[B] S[D] S[C] b1 HS[D] HS[C] b2 c2 c1 d2 d3 d1 Early Result Enumeration (ERM) • Enumerate results and release buffer when elements in top-branch node are popped from PathStack A [1,20], 1 a1 a2 a1 B [2,15], 2 [16,19], 2 a2 b3 D C [17,18], 3 [3,14], 3 d3 b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 [8, 9], 6 [6,7], 6 c1 d2 VLDB' 2006. Seoul, Korea
Memory Usage dblp Small sub-tree article title year site open_auctions Huge sub-tree bid reserve bidder increase VLDB' 2006. Seoul, Korea
Conclusions and Future Work • Proposed a bottom-up GTP processing solution • A twig encoding scheme • A GTP enumeration algorithm that avoids any post-processing operations • A hybrid scheme to reduce memory usage • Future directions • Handling worst case memory issues • Optimizing IO cost by exploiting indexes • Handling other axes, full XQuery, graph input • Handling XML streams • … VLDB' 2006. Seoul, Korea
Processing GTP Optimization of non-return nodes Automatic grouping VLDB' 2006. Seoul, Korea