290 likes | 429 Views
On Efficient Part-match Querying of XML Data. Michal Krátký , michal.kratky@vsb.cz Marek Andrt , marek . andrt @vsb.cz Department of Computer Science VŠB–Technical University of Ostrava Czech Republic. DATESO 2004. Contents.
E N D
On Efficient Part-match Querying of XML Data Michal Krátký, michal.kratky@vsb.cz Marek Andrt, marek.andrt@vsb.cz Department of Computer Science VŠB–Technical University of Ostrava Czech Republic DATESO 2004
Contents • Introduction – XML, query languages, indexing XML data, part-match querying. • Multi-dimensional approach to indexing XML data. • Extension of the multi-dimensional approach for keyword-based querying. • Index data structures. • Preliminary experimental results. 2/21
Introduction • Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema. • XML query languages (XPath, XQL, XQuery,…). • A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on). • Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on. 3/21
Part-match querying XML data • Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on. • Knowledges from IR are applied. • Query languages contain operators for matching term occurrence. For example contains(), ~=. 4/21
Multi-dimensional approach to indexing XML data • A graph is a set of the paths. XML document is decomposed to paths and labelled paths. • labelled path: lp ∈ XLP: s0,s1,...,slPN • path: p ∈ XP: idU(u0),idU(u1),...,idU(ulLP),s idU(ui) – unique number of a node ui 5/21
Indexes • Term index – a storage of strings si of an XML document and their idT(si). • Labelled path index – a storage of points representing labelled paths. • Path index – a storage of points representing paths. 6/21
Examplelabelled path index, pathindex • books,book,id; books,book,title and books,book,author. Points (0,1,2); (0,1,4) and (0,1,6) are created using idT of element and attribute names, idLP = 0, 1 and 2. • For example, the path to value The Two Towers. The labelled path books,book,title with idLP 1 belongs. Vector (1,0,1,3,5) is created using idLP, unique numbers idU of elements, and idT of the term. 7/21
Query for values of elements and attributes • XPath query: books/book[author=“Joseph Heller”] • 3 phases of a query processing, finding: ●idT of terms from the term index, ●idLP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6), ● points from the path index: range query (2,0,0,0,12)×(2,max,max,max,12). 8/21
Enhanced querying • XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,idU(u0),…,idU(ul-1), idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u), maxD,…,maxD). • Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi-dimensional data structure. 9/21
Comparison of approaches • Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results. • Result formatting. For example a result of the query //name is all matched subtree. • Operation Update and Insert are simple possible. 10/21
Keyword-based searching • Motivation: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE • Path-Labelled Path-Term (PLT) index is added. • The index indexes an 3-dimensional space: (idP, idLP, idT). • idP is added into the point representing path: (idP,idLP,idU0,idU1,…,idUl,s). 11/21
Index data structures • Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R-trees. • Problems: ● indexing points with different dimensions. ● narrow range query – the signature is applied for efficient processing – Signature R-tree. • Efficient processing of the complex range query. 14/21
Efficient processing the complex range query • Complex range query = sequence of range queries: qb1,qb2,…,qbn. • The query is possible to process in one run in the multi-dimensional data structure. 15/21
Experimental results • Protein Sequence Database XML document: ● the document size is 683MB, ● number of elements: 21,305,818, ● number of attributes:1,290,647. ● maximal length of path: 7. • BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9. 16/21
Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.'] 17/21
Experimental results Regular path expression • Query: //uid='89071748', 5 labelled paths were matched. • Naive processing the complex range query: DAC: 368 • Efficient processing the complex range query: DAC: 139 • Time: 0.03s, Improvement: 2.5x 18/21
Preliminary experimental results Keyword-based searching • othello.xml: ● document size is 250kB, ● maximal length of the path: 6 ● number of paths: 4,967 ● number of labelled paths: 13 ● number of terms: 8,744 ● PLT index: 27,127 19/21
Preliminary experimental results Keyword-based searching • Query:/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE • Labelled path index: result size: 1, DAC: 3 • PLT index: result size: 1, DAC: 3 • Path index: result size: 1, DAC: 13 • Path index: result size: 1, DAC: 4 20/21
Conclusion • Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2), m1 ,m2 ≥ m. • Efficient processing a query with AND condition. Signature is applied. • Multi-dimensional approach for term searching may be applied (e.g. *comp*). • The update operation of XML documents. • Comparison with another approaches for test collections (INEX, XMark, …). http://www.cs.vsb.cz/arg 21/21
References • M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004. • M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS2510. • M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.
Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’ and 0,9,10,’045-00012’ belong to the labelled path books,book,id, . . . Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author. Paths, labelled paths
Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title ● Combination of above described techniques: query for value, XPath axis processing. Regular path expression queries for example: books//author ● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author. Complex queries
UB-tree B-tree Z-address (B)UB-tree, R-tree
Narrow range query – signature multi-dimensional ds • Regions intersecting a query hyper box are searched, O(NI× logc n). • Ratio cR of relevant NR and intersect NI regions ≪ 1 with an increasing dimension. • Signatures are applied to better filtration of irrelevant regions – signature md structures.
Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']