10 likes | 212 Views
TJFast: Effective Processing of XML Twig Pattern Matching. Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore {lujiahen,chent,lingtw}@comp.nus.edu.sg. [1. INTRODUCTION].
E N D
TJFast: Effective Processing of XML Twig Pattern Matching Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore {lujiahen,chent,lingtw}@comp.nus.edu.sg [1. INTRODUCTION] Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. Our motivation is: (1) The performance of previous holistic twig join algorithms[1][2] can be further improved. (2) Algorithm based on region encoding CANNOT answer queries with wildcards in branching nodes. For example. Figure 4. DTD for the XML tree in Fig 3. Figure 1 An example to illustrate the limitation of region encoding According to region codes, which document, Doc1 or Doc2, matches query? Figure 5. A Finite state transducer for DTD in Fig 4. Given an extended Dewey label, we can use the above finite state transducer to derive its path: For example: 1.6.2 bib/book/chapter/section 1.9.1 bib/book/chapter/title By reading the region encoding of elements a,b,c alone, we CANNOT answer this wildcards branching query. [3. A new holistic algorithm: TJFAST] To answer a twig pattern query, we propose a new holistic twig join algorithm, called TJFast. Compared to previous algorithms, to answer path and twig queries, we only need to access the labels ofleaf nodes, So we significantly reduce I/O cost. For example, given a path query //chapter/section/text, we only access the labels of textto answer this query. Given a twig query: //chapter/section[.//keyword]/text, We only scan keywordand text. Figure 2 An example to answer wildcards query with Dewey scheme Tatarinov et al.[4] proposed a Dewey labeling scheme. It can be used to answer this wildcards query. See Fig 2. Since in Doc 1, b and c does not share the same parent, only Doc 2 matches this wildcard queries. But twig join algorithm based on Dewey scheme is not as efficient as that based on region encoding, since the prefix comparison is more time consuming than integer comparison in region encoding. In this paper, we extend Dewey labeling scheme, which not only can be used to answer wildcards queries, but also has better performance than algorithms on region encoding. TJFast only need to access the labels of LEAF nodes to answer a query. Extended Dewey solve two problems: Wildcards query and Query performance [4. Preliminary experiments] • Experiemntal setting: • We use the random data sets (with 3 millions nodes) consisting of seven labels, namely a,b,...,e. The node labels in the data were uniformly distributed. • We issue four twig queries: a[.//b]//c, a[./b]/c, a[./b/c]/d/e, a[.//b/c]//d/e, • We compare our method with the previous work TwigStack[1] and TwigStackList[2]. [2. Our new labeling scheme: EXTENDED DEWEY] Resutls analysis: TJFast outperforms TwigStack, TwigStackList under all settings The improvement is due to the facts that TJFast only scan labels for query leaf nodes. Algorithmson region encoding is comparable to TJFast only when the number of elements for internal query nodes is very small. Figure 3 An example to answer wildcards query with Dewey scheme Labeling methods: Given a document and DTD, we use module function to match an integer with the certain tag name. For example: book author, title , chapter Assume x(t) denote the last integer of the label of tag t, then x(author) mod 3 = 1, x(title) mod 3 =2 and x(chapter) mod 3 = 0. The label of any text value ends with 0. Reference: (1) N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal XML pattern matching. In SIGMOD Conference, pages 310-321, 2002. (2) J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533~542, 2004 (3) P. O'Neil et al. ORDPATHs: Insert-friendly XML node labels SIGMOD pages 903~908, 2004. (4) I. Tatarinov, et al. Storing and querying ordered XML using a relational database system. In Proc. of SIGMOD, pages 204–215, 2002.