Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern by Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni Presented by: Tian Yu 23, Aug 2005

Outline • Introduction and motivation • Background • XML tree and twig pattern matching • Previous two algorithms: TwigStack and TwigStackList • Our Ordered Twig Algorithms • Ordered Children Extension (for short OCE) • A generalized holistic matching algorithm: OrderedTJ • Experiments • Conclusion Efficient Processing of Ordered XML Twig Pattern

Introduction • XML data representation rapidly increases popularity • XML documents modeled as ordered trees. • XML queries specify patterns of selection predicates on multiple elements having some structural relationships (parent-child, ancestor-descendant) Efficient Processing of Ordered XML Twig Pattern

What is a Twig Pattern? • A twig pattern is a small tree whose nodes are tags, attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges. • E.g. Query description: Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title • Twig pattern : Section Paragraph Title Figure Efficient Processing of Ordered XML Twig Pattern

Motivation • XML documents modeled as ordered trees, it’s natural to have ordered queries. • Four ordered axes: following-sibling, preceding-sibling, following, preceding. • Example: ordered query: //book/title/following-sibling::chapter unordered query : //book/title/chapter Efficient Processing of Ordered XML Twig Pattern

Order axis • Four axis: following-sibling, preceding-sibling, following, and preceding. • In the sample document: Set the context node to be f Context node: f Following of f:i and j Preceding of f: b, c and e Following-sibling of f: i Preceding-sibling of f: e a d b e f c i j g h Sample XML document Following-sibling of f = following of f and share the same parent with f Preceding-sibling of f = preceding of f and share the same parent with f Efficient Processing of Ordered XML Twig Pattern

Ordered Twig Pattern • //chapter[title=“related work”]/following::section • Intuitive meaning: search for all the sections that appear after (but are not descendents of) chapter elements with the title “related work” in the XML document. • The query node Book is ordered Efficient Processing of Ordered XML Twig Pattern

Ordered Twig Pattern • //chapter[title=“related work”]/following::section Efficient Processing of Ordered XML Twig Pattern

Ordered Twig Pattern • //chapter[title=“related work”]/following::section If the twig pattern is unordered: section1, section2, and section3 are all matching elements. Efficient Processing of Ordered XML Twig Pattern

Ordered Twig Pattern • //chapter[title=“related work”]/following::section But for ordered query, section1 and section2are not in the solution. How to know that in our method? Efficient Processing of Ordered XML Twig Pattern

Motivation • Naïve Method: Use the existing algorithm to output the intermediate path solutions for each individual root-leaf query path Merge path solutions so that the final solutions are guaranteed to satisfy the order predicates of the query. • Disadvantage of the naïve method: Many intermediate results may not contribute to final answers. • Our Solution: efficient processing of ordered XML twig patterns. Efficient Processing of Ordered XML Twig Pattern

XML Twig Pattern Matching • An XML document is commonly modeled as a rooted, ordered and taggedtree. book chapter preface chapter …………. “Intro” section section paragraph title section title paragraph paragraph “…” “Data” “…” “…” “XML” Efficient Processing of Ordered XML Twig Pattern

Region Coding • Node Label1: (startPos, endPos, LevelNum) • E.g. (1,21,1) book (2,4,2) (13,20,2) (5,12,2) preface chapter chapter (3,3,3) (9,11,3) (17,19,3) “Intro” (6,8,3) (14,16,3) title section title section (7,7,4) (15,15,4) (18,18,4) (10,10,4) “Data” “Data” “…” “…” “…” M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994. Efficient Processing of Ordered XML Twig Pattern

Region Coding Given e1, e2: e1 is ancestor of e2: iff e1.start < e2.start and e1.end > e2.end. (1,21,1) e1 book (2,4,2) (13,20,2) (5,12,2) preface chapter chapter (3,3,3) (9,11,3) (17,19,3) “Intro” (6,8,3) (14,16,3) title section title section e2 (7,7,4) (15,15,4) (18,18,4) (10,10,4) “Data” “Data” “…” “…” M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994. Efficient Processing of Ordered XML Twig Pattern

Region Coding Given e1, e2: e1 is parent of e2: iff e1.start < e2.start and e1.end > e2.end , and e1.level + 1=e2.level (1,21,1) e1 book (2,4,2) (13,20,2) (5,12,2) e2 preface chapter chapter (3,3,3) (9,11,3) (17,19,3) “Intro” (6,8,3) (14,16,3) title section title section (7,7,4) (15,15,4) (18,18,4) (10,10,4) “Data” “Data” “…” “…” M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994. Efficient Processing of Ordered XML Twig Pattern

Previous work: TwigStack • TwigStack2: a holistic approach • Two-phase algorithm: • Phase 1 TwigJoin: part of intermediate root-leaf paths are outputted • Phase 2 Merge: merge the intermediate paths to get the final results 2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002. Efficient Processing of Ordered XML Twig Pattern

Sub-optimality of TwigStack • TwigStack: optimal when the query contains only ancester-descendant relationship • If the query contains any parent-child relationship, TwigStack may output some intermediate path solutions that cannot contribute to final results. • We call that TwigStack is sub-optimal for queries with parent-child relationships. Efficient Processing of Ordered XML Twig Pattern

TwigStackList • The main problem of TwigStack is to assume all edges are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships. • Improved method: TwigStackList3 [CIKM 2004] • There is an additional list structure for each query node to cache elements that likely participate in final solutions. • TwigStackList3 is an improvement algorithm for TwigStack, since it considers parent-child relationships in the first phase. • TwigStackList is optimal when there is no P-C edge for branching nodes (a branch node is a node with more than one descendant or child) 3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004. Efficient Processing of Ordered XML Twig Pattern

TwigStackList v.s. TwigStack Root Twig Pattern An XML tree • TwigStack output the it output the “uesless” path solution < s1,t1>, since it doesn’t check for parent-child relationsihp. • TwigStackList has no uesless output. < s1,t1> is not in the output. section s2 s1 s1 title p2 t3 paragraph t1 p1 t1 No Parent-child relationship for branching node p3 t2 figure f1 f2 Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension (OCE) • Definition: An element en(of Type n)has an OCE if: 1) In the query Q, for all A-D children of n (if any), n’, there is an element en’ (with tag n’) that is a descendant of en , and en’ also has an OCE; and 2) In the query Q, for all P-C children of n (if any), n’, there is an element e’ (with tag n) in the path en to en’ such that e’ is the parent of en’, and en’ also has an OCE; and 3) For each child (or descendant) n’of n, if there is an node m that isthe immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE. The first two conditions are guaranteed in twigStackList Our main focus is in the third condition Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension (OCE) • Definition: Condition 3) For each child (or descendant) n’of n, if there is an node m that isthe immediate rightSibling of n, there are elements en’ and em such that en’ is a child (or descendant) of element en, en’.end < em.start, and both en’ and emi have OCE. en n > n’ m En’ em Ordered XML Query XML document Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension (OCE) In an Ordered XML query: • If node n is ordered node: In order to find it’s OCE, all the three previous conditions must be checked. • If node n is an unordered node: In order to find it’s OCE, only the first two conditions need to be checked. The last condition does not apply. Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 1 Document: Query: a1 a > e1 c1 e2 b c d b1 d1 Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 1 Document: Query: a1 a > e1 c1 e2 b c d b1 d1 a1 has an OCE Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 1 Document: Query: a1 a > e1 c1 e2 b c d b1 d1 a1 has an OCE 1) a1 has descendants b1and d1, and child c1(fulfill condition 1, 2 of OCE definition) 2) b1 has a right sibling element c1, and c1 has a right sibling element d1 (fulfill condition 3 of OCE definition) Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 2 Document: Query: a1 a > e1 c1 b c d b1 d1 Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 2 Document: Query: a1 a > e1 c1 b c d b1 d1 a1 doesn’t have any OCE Efficient Processing of Ordered XML Twig Pattern

Ordered Children Extension: Example 2 Document: Query: a1 a > e1 c1 b c d b1 d1 a1 doesn’t have any OCE 1) a1 has descendants b1and d1, and child c1(fulfill condition 1, 2 of OCE definition) 2) b1 has a right sibling node c1 (fulfill condition 3 of OCE definition) 3) However, c1only has descendant of d1. There is no element with the labeld d that is a right sibling of element c1 (doesn’t satisfy condition 3 of OCE definition) Efficient Processing of Ordered XML Twig Pattern

Data structure Each node n in the twig query has: Stream, List, and Stack • Data Stream: Tn • we partition an XML document into streams • All elements in a stream are of the same tag and ordered by their start Position • The elements in each stream is read only once from head to tail. a1 Level 1: Ta a1, a2, a3 a > a3 b2 a2 2: b1 , b2 b c d d1, d2, d3 Tb Td d3 d1 3: d2 b1 Tc C1 , C2 4: c2 c1 Document Efficient Processing of Ordered XML Twig Pattern

Data structure Each node n in the twig query has: Stream, List, and Stack • List: Ln • The elements in lists help to check for P-C relationship • Elements in each list Ln are strictly nested from the first to the end, i.e. in the XML document, each element is an ancestor or parent of the following element. La a1, a2… a > Lb b1 .. b c d Ld d1 ,d3 Lc C1 Efficient Processing of Ordered XML Twig Pattern

Data structure Each node n in the twig query has: Stream, List, and Stack • Stack: Sn • Stacks is used to store elements that have at least one OCE • Elements in the stack are potential solutions of the XML query. • When we insert an new element into a stack, the top element of the stack is popped out if the top of the stack doesn’t have A-D relationship with the new element. Sa a > b c d Sb Sd Sc Efficient Processing of Ordered XML Twig Pattern

A holistic matching algorithm: OrderedTJ • We propose a general algorithm, OrderedTJ, that computes answers to an ordered query twig. • Our key focus is to check the ordered nodes in the query and find elements which has at least one OCE. Efficient Processing of Ordered XML Twig Pattern

Main function • OrderedTJMain function operates in two phases. Efficient Processing of Ordered XML Twig Pattern

Main function • OrderedTJMain function operates in two phases. Important function Phase 1 Phase 2 Phase 1: Parts of query root-leaf paths are output. The ordering requirements in the ordered query is checked. Phase 2: These solutions are merged-joined to compute the answers to the whole query. Efficient Processing of Ordered XML Twig Pattern

getNext(n) • It gets the next stream to be processed and advanced Check Order Check P-C Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Next Action: Title: t1, t2, t3 Partition an XML document into streams “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: t1, t2, t3 Next Action: Show lists for nodes with P-C child “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Show Stacks of every node in the query “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: t1 has no descendant “related work” c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 advance (Title) “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: t2 has descendant “related work” c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Insert t2 into the list of Title “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: C1 has no descendant title that has child “related work” c1 c2 c3 Book Query: > t1 s1 t2 s2 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 t2 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Advance (Chapter) “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: C2 has a descendant t2 that has child “related work” c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 t2 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Insert c2 into the list of chapter “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: c1 s1 is not the following element of c2 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 c2 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 t2 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Advance(Section) “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 Document: c1 c2 c3 Book Query: > s2 is not the following element of c2 t1 t2 s2 s1 t3 s3 c2 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 t2 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Advance(Section) “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

An example of OrderedTJ algorithm b1 is has an OCE b1 Document: c1 c2 c3 Book Query: > t1 t2 s2 s1 t3 s3 c2 Chapter Section “Introduction” “Related work” “Algorithm” Title Book: b1 t2 Chapter: c1, c2, c3 “Related work” Section: s1, s2, s3 Title: Next Action: t1, t2, t3 Push b1 into the stack of Book “related work” “Related work” Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern

Efficient Processing of Ordered XML Twig Pattern

Presentation Transcript

Trie Indexes for Efficient XML Query Processing

Efficient Processing of RDF Graph Pattern Matching on MapReduce Platforms

On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques

Holistic Twig Joins: Optimal XML Pattern Matching

Holistic Twig Joins: Optimal XML Pattern Matching

Holistic Twig Joins Optimal XML Pattern Matching

Efficient Processing of Updates in Dynamic XML Data

Indexing Methods for Efficient XML Query Processing

Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

Compiler Support for Efficient Processing of XML Datasets

Combining efficient XML compression with query processing

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

Processing XML

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

Holistic Twig Joins: Optimal XML Pattern Matching

Efficient Processing of XML Update Streams

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching