430 likes | 867 Views
XML Data Management. Ning Zhang University of Waterloo. What is XML?. XML documents have elements and attributes Elements (indicated by begin & end tags) can be nested but cannot interleave each other can have arbitrary number of sub-elements can have free text as values
E N D
XML Data Management Ning Zhang University of Waterloo
What is XML? • XML documents have elements and attributes • Elements (indicated by begin & end tags) • can be nested but cannot interleave each other • can have arbitrary number of sub-elements • can have free text as values <chap title=“Introduction To XML”> some free text <sect title= “What is XML?”> … </sect> <secttitle = “Elements”> … </sect> <secttitle = “Why XML?”> … </sect> … possibly more free text </chap> end element attribute begin element Elements w/ same name can be nested
Why XML? • Database Side: XML is a new way to organize data • Relational databases organize data in tables • XML documents organize data in ordered trees • Document Side: XML is a semantic markup language • HTML focuses on presentation • XML focuses on semantics/structure in the data chap sect sect sect sect sect sect <html> <h1> Chapter 1… </h1> some free text <h2> Section 1… </h2> some more free text <h3> Section 1.1 </h3> </html>
Data Management: -- Relational vs. XML • Relational data are well organized – fully structured (more strict): • E-R modeling to model the data structures in the application; • E-R diagram is converted to relational tables and integrity constraints (relational schemas) • XML data are semi-structured (more flexible): • Schemas may be unfixed, or unknown (flexible – anyone can author a document); • Suitable for data integration (data on the web, data exchange between different enterprises).
More about Relational vs. XML • XML is not meant to replace relational database systems • RDBMSs are well suited to OLTP applications (e.g., electronic banking) which has 1000+ small transactions per minute. • XML is suitable data exchange over heterogeneous data sources (e.g., Web services) that allow them to “talk”.
When should we use XML? (1) • Document representation language • XML can be transformed to other format (e.g., by XSLT) • XML HTML • XML LaTeX, bibTeX • XML PDF • DocBook (standard schema for authoring document/book)
When should we use XML? (2) • Data integration and exchange language • Web services (SOAP, WSDL, UDDI) • Amazon.com, eBay, Microsoft MapPoint, … • Domain specific data exchange schemas (>1000) • legal document exchange language • business information exchange … • RSS XML news feed • CNN, slashdot, blogs, …
When should we use XML? (3) • Any data having hierarchical structure • Email • Header – from, to, cc, bcc… • Body – my message, replied email … • Network log file • IP address, time, request type, error code • Advances of translating to XML • Exploit high-level declarative XML query languages
XML Databases • Advantages: • Manage large volume of XML data • Provide high-level declarative language • Efficiently evaluate complex queries • XML Data Management Issues: • XML Data Model • XML Query Languages • XML Query Processing and Optimization
XML Data Model • Hierarchical data model • An XML document is an ordered tree; • Nodes in the tree are labeled with element names. • Element nesting relationship corresponds to parent-child relationship; chap @title sect … some free text Introduction to XML @title sect … @title What is XML? …
XML Schema Languages • Schema languages defines the structure: • Document Type Definition (DTD) • Context-free grammar • Structurally richer than relational schema definition language because of recursion. • XML Schema • Also context-free • Richer than DTD because of data types definition (integer, date, sequence).
XML Query Languages • XPath • 13 axes (navigation directions in the tree) • child (/), descendant (//), following-sibling, following… • NameTest, predicates • E.g, doc(“bib.xml”)//book[title=“Harry Potter”]/ISBN • XQuery (superset of XPath) • FLWOR expression for $x in doc(“bib.xml”)//book[title = “Harry Potter”]/ISBN, $y in doc(“imdb.xml”)//movie where $y//novel/ISBN = $x return $y//title
Important Problems in XML Data Management What follows is not covered in COSC 3480!! • How to store XML data? • How to efficiently evaluate XPath/XQuery languages? • Efficient physical operators • Query optimization • How to support XML update languages? • How to support transaction management? • Recovery management?
Agenda • XML Storage • XML Path Query Processing • XML Optimization
XML Storage • Extended Relational Storage • Convert XML documents to relational tables • Native Storage • Treat XML elements as first-class citizens • Hybrid of Relational and Native Storage • XML documents can be stored in columns of relational tables (XML typed column)
Extended Relational Storage • Edge-based Storage Scheme (Florescu and Kossman ‘99) • Each node has an ID • Each tuple in the edge table consists of: (parentID, childID, type of data, reference to data) • Pro: easy to convert XML to relational tables • Con: impossible to answer path queries such as //a//b using SQL (needs transitive closure operator)
Extended Relational Storage • Path-based Storage Scheme XRel (Yoshikawa et al. ‘01) • Each node corresponds to a tuple in the table • Each tuple keeps a rooted path to the node (e.g., /article/chap/sec/sec/@title) • Pro: also easy to convert XML to tables • Con: answering path queries, such as //a//b, needs expensive string pattern matching
Extended Relational Storage • Node-based Storage Scheme: Niagara, TIMBER etc. (Zhang et al. ’01) • Each node is encoded with a “begin” and “end” integers. • Begin corresponds to the order of in-order traversal of tree; end corresponds to the order in post-order traversal. • Pro: checking parent-child/ancestor-descendant relationships is efficient (constant time using begin and end) • Con: inefficient for updating XML
Native Storage • Subtree partition-based scheme: Natix (Kanne and Moerkotte ’00) • A large XML tree is partitioned into small subtrees, each of which can be fit into one disk page • Introducing aproxy and aggregate nodes to connect different subtrees • Pro: easy to update and traversal • Con: complex update algorithm; frequent deletion/addition may deteriorate page usage ratio
Native Storage • Binary tree-based scheme: Arb (Koch ’03) • Convert a tree with arbitrary number of children to a binary tree (first child translates to left child; next sibling translate to right child) • Tree nodes are stored in document order • Each node has 2 bits indicating whether it has a left & right child • Pro: easy to do depth-first search (DFS) traversal • Con: inefficient to do next_sibling navigation and hard to update
Native Storage • String-based scheme: NoK (Zhang ’04) • Convert a tree to a parenthesized string • E.g., a having b and c as children is converted to ab)c)), by DFS of the tree and ‘)’ representing “end-of-subtree” • Tree can be reconstructed by the string • A long string can be cut into substrings and fit them into disk pages • Page header can contains simple statistics to expedite next_sibling navigation • Pro: particularly optimized for DFS navigational evaluation plan • Con: inefficient to do for breadth-first search (BFS)
Hybrid of Relational and Native Storage • All major commercial RDBMS vendaors (IBM, Oracle, Microsoft and Sybase) support XML type in their RDBMS • A table can have a column whose type is “XML” • When inserting a tuple in the table, the XML field could be an XML document • XML documents are stored natively
Hybrid of Relational and Native Storage • IBM DB2 UDB • System RX – XML storage is similar to Natix • Microsoft SQL Server • Uses BLOB (binary large object) to represent XML documents • Oracle • Can use multiple format: • CLOB (character large object) • Serialized object • Shredded relational table
Agenda • XML Storage • XML Path Query Processing • XML Optimization
XML Path Processing • Extended Relational Approach • Translate XML queries to SQL statements • Native Approach (may be based on extended relational storage) • Join-based approach • Navigational approach • Hybrid approach
Extended Relational Query Processing • Regular expression based approach: XRel (Yoshikawa et al. ‘01) • Linear path expression (without branches) are translated to regular expressions on strings (rooted paths) • Use the “like” predicate in SQL to evaluate regular expressions • Pro: easy to implement • Con: cannot answer branching path queries
Extended Relational Query Processing • Dynamic Interval based approach: DI (DeHaan et al. ‘03) • Use the node labeling (begin,end) interval storage scheme • Dynamically calculate (begin,end) intervals for resulting nodes give a path/FLWOR expression • Pro: can handle all types of queries including FLWOR expression • Con: inefficient for answering complex path queries
Native Path Query Processing • Merge-Join based approach: Multi-predicate Merge Join (MPMGJN) algorithm (Zhang et al. ’01) • Modify the merge join algorithm to reduce unnecessary comparisons • Keep to position p of the last successful comparisons in the right input stream • The next item from the left input stream starts scanning from position p.
Native Path Query Processing • Stack-based Structural Join (Wu et al. ’02) • Improve the MPMGJN algorithm • Do not look back but keep all ancestors in a stack • When comparing the new item, just compare it with the top of the stack
Native Path Query Processing • Holistic Twig Join (Bruno et al. ’02) • Improve the stack-based structure algorithm • Use one join algorithm for the whole path expression instead of one join for one step • Reduce the overhead to produce and store intermediate results
Native Path Query Processing • Natix (Brantner et al. ’05) • Translate each step into a logical navigational operator Unnest-Map • Each unnest-map operator is translated into a physical operator that performs tree traversal on the Natix storage • Physical optimization can be performed on the physical navigational operators to reduce cross-cluster I/O.
Native Path Query Processing • IBM DB2 XNav (Josifovski et al. ’04) • XML path expressions are translated into automata • The automaton is constructed dynamically while traversing the XML tree in DFS • Physical I/O can be optimized by navigating to next_sibling without traversing the whole subtree
Native Path Query Processing • Tree automata (Koch ’03) • The tree automaton needs two passes of tree • The first traversal is a bottom-up deterministic tree automaton to determine which states are reachable • The second traversal is a top-down tree automaton to prune the reachable states and compute predicates.
Hybrid Processing • BlossomTree (Zhang ’04, Zhang’05) • Navigational approach is efficient for parent-child navigation • Join-based approach is efficient for ancestor-descendant • BlossomTree approach identifies sub-expressions, Next-of-Kin (NoK), that are efficient for navigational approach. • Use navigational approach for NoK subexpressions and use structural joins to join intermediate results
XML Indexing • Structural Index • Clustering tree nodes by their structural similarity (e.g., bisimilarity and F&B bisimilarity) • Index is a graph, in which each vertex is an equivalence class of similar XML tree nodes • Path query evaluation amounts to navigational evaluation on the graph
Agenda • XML Storage • XML Path Query Processing • XML Optimization
Overview of Cost-based Optimization • Query Optimization depends on: • How much knowledge about the data we have? • How intelligent we can make use of the knowledge (within a time constraint)? • The cost of a plan is heavily dependent on: • The cost model of each operator • The cardinality/selectivity of each operator
Cardinality Estimation • Full path summarization: DataGuide (Goldman ’97) and PathTree (Aboulnaga ’01) • Summarize all distinct paths in XML documents in a graph • Cardinality information is annotated on graph vertices
Cardinality Estimation • Partial path summarization: Markov Table (Aboulnaga ’01) • Keep sub-paths and cardinality information in a table • Cardinality for longer paths are calculated using partial paths. • Can use additional compression methods to accommodate Internet scale database
Cardinality Estimation • Structural clustered summarization: XSketch (Neoklis ’02) and TreeSketch (Neoklis ’04) • Similar idea as clustered-based index • XSketch uses forward and backward stability, and TreeSketch uses count stability as similarity measurement • Heuristics to reduce graph to fit memory budget
Cardinality Estimation • Decompression-based approach: XSEED (Zhang ’06) • XML documents are compressed into a small kernel with edge cardinality labels • Kernel can be decompressed into XML document with cardinality annotations • Navigational path operator can be reused on the decompressed XML document for cardinality estimation
Cost Modeling • Statistical Learning Cost Model: COMET (Zhang ’05) • Relational operator cost modeling is performed by analyzing the source code • XML operators are much more complex than relational operators; therefore analytical approach is too time-consuming • Statistical learning approach needs a training set of queries and learn the cost model from the input parameters and real cost.