1 / 29

On Efficient Part-match Querying of XML Data

On Efficient Part-match Querying of XML Data. Michal Krátký , michal.kratky@vsb.cz Marek Andrt , marek . andrt @vsb.cz Department of Computer Science VŠB–Technical University of Ostrava Czech Republic. DATESO 2004. Contents.

arlais
Download Presentation

On Efficient Part-match Querying of XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Efficient Part-match Querying of XML Data Michal Krátký, michal.kratky@vsb.cz Marek Andrt, marek.andrt@vsb.cz Department of Computer Science VŠB–Technical University of Ostrava Czech Republic DATESO 2004

  2. Contents • Introduction – XML, query languages, indexing XML data, part-match querying. • Multi-dimensional approach to indexing XML data. • Extension of the multi-dimensional approach for keyword-based querying. • Index data structures. • Preliminary experimental results. 2/21

  3. Introduction • Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema. • XML query languages (XPath, XQL, XQuery,…). • A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on). • Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on. 3/21

  4. Part-match querying XML data • Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on. • Knowledges from IR are applied. • Query languages contain operators for matching term occurrence. For example contains(), ~=. 4/21

  5. Multi-dimensional approach to indexing XML data • A graph is a set of the paths. XML document is decomposed to paths and labelled paths. • labelled path: lp ∈ XLP: s0,s1,...,slPN • path: p ∈ XP: idU(u0),idU(u1),...,idU(ulLP),s idU(ui) – unique number of a node ui 5/21

  6. Indexes • Term index – a storage of strings si of an XML document and their idT(si). • Labelled path index – a storage of points representing labelled paths. • Path index – a storage of points representing paths. 6/21

  7. Examplelabelled path index, pathindex • books,book,id; books,book,title and books,book,author. Points (0,1,2); (0,1,4) and (0,1,6) are created using idT of element and attribute names, idLP = 0, 1 and 2. • For example, the path to value The Two Towers. The labelled path books,book,title with idLP 1 belongs. Vector (1,0,1,3,5) is created using idLP, unique numbers idU of elements, and idT of the term. 7/21

  8. Query for values of elements and attributes • XPath query: books/book[author=“Joseph Heller”] • 3 phases of a query processing, finding: ●idT of terms from the term index, ●idLP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6), ● points from the path index: range query (2,0,0,0,12)×(2,max,max,max,12). 8/21

  9. Enhanced querying • XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,idU(u0),…,idU(ul-1), idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u), maxD,…,maxD). • Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi-dimensional data structure. 9/21

  10. Comparison of approaches • Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results. • Result formatting. For example a result of the query //name is all matched subtree. • Operation Update and Insert are simple possible. 10/21

  11. Keyword-based searching • Motivation: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE • Path-Labelled Path-Term (PLT) index is added. • The index indexes an 3-dimensional space: (idP, idLP, idT). • idP is added into the point representing path: (idP,idLP,idU0,idU1,…,idUl,s). 11/21

  12. Path-Labelled Path-Term index Example 12/21

  13. Query processing plan Example 13/21

  14. Index data structures • Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R-trees. • Problems: ● indexing points with different dimensions. ● narrow range query – the signature is applied for efficient processing – Signature R-tree. • Efficient processing of the complex range query. 14/21

  15. Efficient processing the complex range query • Complex range query = sequence of range queries: qb1,qb2,…,qbn. • The query is possible to process in one run in the multi-dimensional data structure. 15/21

  16. Experimental results • Protein Sequence Database XML document: ● the document size is 683MB, ● number of elements: 21,305,818, ● number of attributes:1,290,647. ● maximal length of path: 7. • BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9. 16/21

  17. Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.'] 17/21

  18. Experimental results Regular path expression • Query: //uid='89071748', 5 labelled paths were matched. • Naive processing the complex range query: DAC: 368 • Efficient processing the complex range query: DAC: 139 • Time: 0.03s, Improvement: 2.5x 18/21

  19. Preliminary experimental results Keyword-based searching • othello.xml: ● document size is 250kB, ● maximal length of the path: 6 ● number of paths: 4,967 ● number of labelled paths: 13 ● number of terms: 8,744 ● PLT index: 27,127 19/21

  20. Preliminary experimental results Keyword-based searching • Query:/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE • Labelled path index: result size: 1, DAC: 3 • PLT index: result size: 1, DAC: 3 • Path index: result size: 1, DAC: 13 • Path index: result size: 1, DAC: 4 20/21

  21. Conclusion • Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2), m1 ,m2 ≥ m. • Efficient processing a query with AND condition. Signature is applied. • Multi-dimensional approach for term searching may be applied (e.g. *comp*). • The update operation of XML documents. • Comparison with another approaches for test collections (INEX, XMark, …). http://www.cs.vsb.cz/arg 21/21

  22. References • M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004. • M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS2510. • M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.

  23. Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’ and 0,9,10,’045-00012’ belong to the labelled path books,book,id, . . . Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author. Paths, labelled paths

  24. Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title ● Combination of above described techniques: query for value, XPath axis processing. Regular path expression queries for example: books//author ● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author. Complex queries

  25. UB-tree B-tree Z-address (B)UB-tree, R-tree

  26. Narrow range query – signature multi-dimensional ds • Regions intersecting a query hyper box are searched, O(NI× logc n). • Ratio cR of relevant NR and intersect NI regions ≪ 1 with an increasing dimension. • Signatures are applied to better filtration of irrelevant regions – signature md structures.

  27. Signature R-tree

  28. Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']

  29. Experimental results

More Related