350 likes | 502 Views
Indexing Methods for Efficient XML Query Processing. Jun-Ki Min KAIST http://islab.kaist.ac.kr/~jkmin/. XML. eXtensible Markup Language The de facto standard data representation and exchange on the Web XML Data An instance of semistructured data self-describing irregularly structured.
E N D
Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST http://islab.kaist.ac.kr/~jkmin/ 2002 SIGDB Tutorial
XML • eXtensible Markup Language • The de facto standard • data representation and exchange on the Web • XML Data • An instance of semistructured data • self-describing • irregularly structured 2002 SIGDB Tutorial
XML Data • Comprise hierarchically nested collections of elements • Element can contains • Atomic data value • A sequences of subelements • attributes composed of name-value pairs • ID-IDREF relationship • Tree or Graph representation 2002 SIGDB Tutorial
libraryDB 1 book paper 6 2 title editor 7 title chapter author author 3 4 5 author section 8 9 10 XML Example <libraryDB> <book editor = 1> <title> title1 </title> <author> author1 </author> <chapter> … </chapter> </book> <paper> <title> title2</title> <author id = 1> author2 </author> <author> author3 </author> <section> … </section> </paper> … </libraryDB> Index Fabric ToXin APEX 2002 SIGDB Tutorial
XML Query • XML Query Language • XSLT, XML-QL, XPath, XQuery • use path expression to traverse the irregularly structured data • ex) /libraryDB/book/title or //title • search the whole XML data => inefficiency • Structural Summary & Path Index • by restricting the search to only relevant portions of XML Data 2002 SIGDB Tutorial
Schemas for XML • DTD, XML Schema • Specifies the constraints of XML Data • <!ELEMENT book (title, author+,chapter*)> • are not mandatory => lack of external schema • Structural Summary • Summary of label paths • Path Index • Structural Summary + Extents 2002 SIGDB Tutorial
Schemas for XML • Applications • User Interface • XML Data Design, Editing • Query Formulation • Query Validation • Query Optimization • Path Index 2002 SIGDB Tutorial
Structural Summary • DTD Extraction • XTRACT • based on element information • Structural Summary • Representative Objects • based on path information 2002 SIGDB Tutorial
XTRACT • [Garofalakis, Gionis, Rastogi, Seshadri, Shim: SIGMOD 00] • Infer concise and accurate DTD • Choose a DTD from candidate DTDs • (a b),(b a) => (a|b)* or (a b)|(b a) • Based on Minimum Description Length (MDL) Principle • ranks each candidate DTDs depending on the number of bits required to describe the subelement sequences in terms of the candidate DTD • 6(for DTD)+3+3 = 12 • 9(for DTD)+1+1 = 11 2002 SIGDB Tutorial
Representative Objects(RO) • [Nestorov, Ullman, Wiener, Chawathe : ICDE 97] • Provide a concise representation of the inherent schema of a semistructured hierarchical data • Full-RO • Describe all simple paths • K-RO • K-RO guarantees that its paths whose length are k+1 exist in data. • 1-RO • Simplest & very compacted representation 2002 SIGDB Tutorial
libraryDB libraryDB paper book book paper title title title author section editor section chapter author editor title author chapter author author name libraryDB name Graph Representation of 1-RO XML Data paper book title title chapter editor section author author name Graph Representation of 2-RO(= Full-RO) Representative Objects(RO) 2002 SIGDB Tutorial
Path Index • Access Support Relations • Deterministic • Strong DataGuide • Index Fabric • ToXin • APEX • Non-Deterministic • 1-Index • A(k) Index • F&B Index 2002 SIGDB Tutorial
Access Support Relations • [Kemper, Moerkotte: IS 92] • Originated from OODBMS select Name from Mercedes.Manufactures.Composition.Division • To support join along arbitrary reference chains • Generalization of Join Index[Valduriez 87] • Based on the paths in the schema • Materialize access paths of arbitrary length • Support only predefined subsets of paths. 2002 SIGDB Tutorial
DataGuides • [Goldman, Widom : VLDB 97] • An implementation version of Full-RO • Summary of label paths from the root (= simple paths) • Concise: describe every unique simple path exactly once, regardless of the number of times it appears • Accuracy: do not contains label paths that do not appear in the data • Convenience: can store and access it using similar techniques available for processing semistructured data 2002 SIGDB Tutorial
A A B B A B A B B C C C C C C C C C C An XML Data Various DataGuides DataGuides • Construction Algorithm emulates the conversion algorithm from non-deterministic finite automata (NFA) to deterministic finite automata (DFA) • Intuitively, a simple path is represented as a node in DataGuide • One XML Data may have multiple DataGuides 2002 SIGDB Tutorial
1 1 1 1 A A B A B A B A A B 2,4 6 2 4 6 2 4 6 2,4 6 C C C C C C C C C C C 3,5 3 5 3 5 3,5 5 Source Strong DataGuide Source Strong DataGuide Strong DataGuide • If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. • Linear time and linear space for tree structured data • Exponential time and exponential space for graph structured data 2002 SIGDB Tutorial
1/2/T-Index • [Milo and Suciu: ICDT 99] • 1-Index • Summary all label paths starting from the root • Support queries of q= Px where P = /l1/l2/…/ln • Non-deterministic • Based on backward bisimulation which is originated from graph verification • Extents are disjoint • More compact size than Strong DataGuides 2002 SIGDB Tutorial
1-Index • Equivalence relation (≡) v ≡ u iff Lv =Lu where Lx = {w| w is a simple path from the root to x} • the collection of all equivalence class • Exponential construction cost • Backward Bisimulation (≈b) • If x≈by and x is the root then y is the root • Conversely, If x≈by and y is the root, then x is the root. • If x≈by and <x’l x> is an edge, then there is exists an edge (y’l y), such that x’≈by’ • Conversely, if x≈by and (y’l y) is an edge, then there exists an edge (x’l x) such that x’≈by’ 2002 SIGDB Tutorial
≈b ≡ vs ≈b a a • X ≡ Y since LX = LY = {a.b.d, a.c.d} • X Y • v ≈b u v ≡ u • O(mlogm) construction cost [Paige and Tarjan 87] a a c b c b d d d X Y 2002 SIGDB Tutorial
libraryDB 1 libraryDB paper book 1 6 paper 2 book title 6 2 title 7 editor editor section 7 title author chapter author section title author chapter author author 3 4 8 10 8,9 5 Strong DataGuide 4 5 8 9 10 3 XML Data libraryDB 1 paper book 6 2 title 7 editor title author chapter author section 3 4 8 9 10 5 1-Index 1-Index vs Strong DataGuide • In tree structured Data, strong Dataguide and 1-Index coincide 2002 SIGDB Tutorial
2/T-Index • 2-Index • To support queries of x1Px2 • ex) //title • Equivalence relation (≡) (v, u) ≡ (v’, u’) iff L(v,u) =L(v’,u’) where L(x,y) = {w| w is a label path from x to y} • Summary of path information bwt. two arbitrary nodes • T-Index • Generalization of 1/2-Index (v1,…,vn )≡ (u1,…,un) iff L(v1,…,vn) =L(u1,…,un) • Conceptually similar to Access Support Relations • Support only predefined paths 2002 SIGDB Tutorial
Index Fabric • [Cooper, Sample, Franklin, Hjaltason, Shadmon, VLDB 01] • Tree Structured Data • Conceptual similar to strong DataGuide • Layered structure • Use Patricia trie to index a large number of search keys • The simple path of an element which has a data value is encoded as a special character sequence • Keeps the key which is the combination of encoded sequence and data value. 2002 SIGDB Tutorial
0 P L “L” “L” 1 1 B P B 2 2 2 … 2 C T A C C “LBC” … LBTtitle1 LBAauthor1 8 Patricia Trie Index Fabric • Keeps only the information of elements which have data values • Patricia trie : lossy Compression XML Data 2002 SIGDB Tutorial
ToXin • [Rizzolo, Mendelzon: WebDB 01] • Tree Structured Data • Conceptually Similar to strong DataGuide (not minimal DataGuide) • Support navigation of forward and backward traversal • Path Tree ( = strong DataGuide) • A node of Path Tree has an Index Table or Value Tables • Index Table (IT): parent-child relationships • Value Table (VT): owner-value relationships 2002 SIGDB Tutorial
LibraryDB:IT book:IT paper:IT title:VT section title:VT chapter author:VT author:VT ToXin XML Data • Since ToXin keeps parent-child relationships, ToXin supports path expression with value predicates • ex) /libraryDB/book[author = author1] • Index Tables LibararyDB parent child null 1 LibraryDB.book parent child 1 2 LibraryDB.paper parent child 1 6 • Value Tables • LibraryDB.book.author • parent value • author1 … 2002 SIGDB Tutorial
A(k)-Index • [Kaushik, Shenoy, Bohannon, Gudes: ICDE 02] • Strong DataGuide and 1-Index record the all simple paths • Increase index size => Increase search space • Approximation of 1-Index • Non-deterministic • Utilize local similarity(= degree k) • reduce the size of index graph 2002 SIGDB Tutorial
A A A A C B C B C B B C D D D D D D D D E E E E E E E XML Data A(0)-Index A(1)-Index A(2)-Index (= 1-Index) A(k)-Index • k-bisimulation (≈k) • For any two nodes, v and u, v ≈0 u iff u and v have the same label • Node v≈ku iff v≈k-1u and for every parent v’ of v, there is a parent u’ of u such that v’≈k-1u’ 2002 SIGDB Tutorial
A(k)-Index • Building cost = O(km) • In general, for 1-Index, k < logm • Query Processing • label path expression whose length ≤ k+1 • precise • label path expression whose length > k+1 • safe : include false results • validation => require the data scan 2002 SIGDB Tutorial
APEX:Adaptive Path indEx for XML Data • [Chung, Min, Shim : SIGMOD 02] • Strong DataGuide and 1-Index are kept the all simple paths • Users used partial matching path queries • //book/title • Exhaustive navigation of index structure for partial matching path queries may result in performance degradation 2002 SIGDB Tutorial
APEX • Deterministic • Approximation of DataGuides • Efficient processing of partial matching path queries • Workload-Aware • Self Tuning Strategies [Chaudhuri et. al 00] • Utilize Query Workload • Build APEX with both XML data and frequently used paths • Sequential pattern mining [Agrawal and Srikant 95] 2002 SIGDB Tutorial
&0 libraryDB &1 paper book &3 &2 title title author author section chapter &9 &8 &4 &6 &5 editor &7 APEX APEX frequently used paths = {book.title} • Hash Tree • keep frequently used paths • prevent the exhaustive search • Graph Structure • structural summary + extents extent &0: {<null,0>} &1: {<0,1>} &2: {<1,2> }&3: {<1,6>} &4: {<2,4>, <6,8>, <6,9>} &5: {<2,5>} &6: {<6,10>} &7: {<2,8>} &8: {<2,3>} &9: {<6,7>} XML Data 2002 SIGDB Tutorial
A B C F&B Index • [Kaushik, Bohannon, Naughton, Korth : SIGMOD 02] • Support Twig path expression • /A/B[C] • Basic Idea • For every edge e labelled l from v to u, add an (inverse) edge e-1 with label l-1 from u to v • And then, compute 1-Index on this modified graph. • Very large Index space • Apply some heuristics • Exploiting Local Similarity : k-bisimulation A B C-1 2002 SIGDB Tutorial
Discussion • Path Index • Improve the query performance by restriction of search space • Can be apply to various application • Selectivity Estimation • QBE(Query By Example) • Future Work • Support twig queries • Query Optimization • cost formula of path index 2002 SIGDB Tutorial
Thank You! • Any Question? • http://islab.kaist.ac.kr/~jkmin • jkmin@islab.kaist.ac.kr 2002 SIGDB Tutorial
Reference • C. Chung, J. Min and K. Shim, “ APEX: An Adaptive Path Index for XML Data,” SIGMOD 02 • B. Cooper, N. Sample, M. Franklin, G. Hjaltason and M. Shadmon, “A Fast Index for Semistructed Data,” VLDB 01 • M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim, “ XTRACT: A System for Extracting Document Type Descriptors from XML Documents,” SIGMOD 00 • L. Goldman and J. Widom, “ DataGuides: Enabling Queries Formulation and Optimization in Seminstructured Databases,” VLDB 97 • R. Kaushik, P. Bohannon, J. Naughton and H. Korth, “Covering Indexes for Branching Path Queries,” SIGMOD 02 • R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes, “Exploiting Local Similarity for Indexing Paths in Graph-Structured Data,” ICDE 02 • A. Kemper and G. Moerkotte, “Access Support Relations: An Indexing Method for Object Bases,” Information Systems 92 • T. Milo and D. Suciu, “ Index Structures for Path Expressions,” ICDT 99 • S. Nestorov, J. Ullman, J. Wiener and S. Chawathe, “ Representative Objects : Concise Representations of Semi structured, Hierarchical Data,” ICDE 97 • F. Rizzolo and A. Mendelzon,” Indexing XML Data with ToXin,” WebDB 01 • R. Paige and R. Tarjan, “Three partition refinement algorithms,” SIAM Journal of Computing 87 • P. Valduriez, “Join Indices,” TODS 87 2002 SIGDB Tutorial