230 likes | 417 Views
XML Databases. Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005. Administrivia. We’re moving beyond simple databases now… For Monday – read & compare focus of: Hanson: Scalable Trigger Processing Stanford STREAM processor For Wednesday:
E N D
XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005
Administrivia • We’re moving beyond simple databases now… • For Monday – read & compare focus of: • Hanson: Scalable Trigger Processing • Stanford STREAM processor • For Wednesday: • Retrospective on Aurora
XML: What Makes It Hard? • It’s not normalized… • It conceptually centers around some origin, meaning that navigation becomes central • Contrast with E-R diagrams • How to store the hierarchy? • Complex navigation • Updates, locking • Optimization • Also, it’s ordered • May restrict order of evaluation (or at least presentation) • Makes updates more complex • Many of these issues aren’t unique to XML • Semistructured databases, esp. with ordered collections, were similar • But our efforts in that area basically failed…
XML: What’s It Good For? • Collections of text documents, e.g., the Web, doc DBs • … How would we want to query those? • IR/text queries, path queries, XQueries? • Interchanging data • SOAP messages, RSS, XML streams • Perhaps subsets of data from RDBMSs • Storing native, database-like XML data • Caching • Logging of XML messages • …?
Lots of XML Research Out There • Text: • Hybrids of database and IR techniques for search • (e.g., Amer-Yahia & Shanmugasundaram, Weikum & Ramakrishnan, …) • Interchange: • Web service verification • XML stream processing • XML databases: • Natix, TIMBER, … • Tamino, DB2 UDB, Oracle, …
The Main Focal Points • XML with documents • Inverted indices • Integration of ranking into DBMS • Interaction between structure and content • “Streaming XML” • RDBMS XML export • Partitioning of computation between source and mediator • “Streaming XPath” engines • XML databases • Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …) • Query optimization
Text-Based XML • The fundamental questions: • How should we model ranking in query processing? • Simply as another value (e.g., Amer-Yahia & Shanmugasundaram) • Using a probabilistic model or as an undefined metric • e.g., Weikum and Ramakrishnan work-in-progress • How does structure affect ranking? • PageRank-style (e.g., Shanmugasundaram et al.) • Query relaxation (FleXPath) • Other? • How do we achieve efficient pruning? • A* search [Cohen 98] • Fagin’s Threshold Algorithm • Custom logic? • How do we integrate keyword indexing with structural indexing? • Multiple indices (e.g., Lore, Natix, …) • Integrated indices (e.g., ViST)
XML as a Wire Format • RDBMS XML export • SilkRoute and Xperanto, outer unions • Interaction with RDBMS optimization techniques • Updates [Tatarinov+01] • Cascading updates are already possible in RDBMSs • Updating XML views • Streaming XML • SAX-based XPath-matching engines [Ives+01][Altinel&Franklin00][Green+02] [Diao&Franklin][Chen+] … • Push-down of XPath matching as early as possible • Query decomposition (still in need of a standard means of pushing XQuery to a source) • Subsets of XQuery that are amenable to streaming
XML in a Database • Use a legacy RDBMS • Shredding [Shanmugasundaram+99] and many others • Path-based encodings [Cooper+01] • Region-based encodings [Bruno+02][Chen+04] • Order preservation in updates [Tatarinov+02], … • What’s novel here? How does this relate to materialized views and warehousing? • Native XML databases • Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …) • Updates and locking • Query optimization (e.g., that on Galax)
Query Processing for XML • Why is optimization harder? • Hierarchy means many more joins (conceptually) • “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op • Though typically parent-child relationships • Often don’t have good measure of “fan-out” • More ways of optimizing this • Order preservation limits processing in many ways • Nested content ~ left outer join • Except that we need to cluster a collection with the parent • Relationship with NF2 approach • Tags (don’t really add much complexity except in trying to encode efficiently) • Complex functions and recursion • Few real DB systems implement these fully • Why is storage harder? • That’s the focus of Natix, really
The Natix System • In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS • Physical layout • Indexing • Locking/concurrency control • Logging/recovery
Physical Layout • What are our options in storing XML trees? • At some level, it’s all smoke-and-mirrors • Need to map to “flat” byte sequences on disk • But several options: • Shred completely, as in many RDBMS mappings • Each path may get its own contiguous set of pages • e.g., vectorized XML [Buneman et al.] • An element may get its 1:1 children • e.g., shared inlining [Shanmugasundaram+] and [Chen+] • All content may be in one table • e.g., [Florescu/Kossmann] and most interval encoded XML • We may embed a few items on the same page and “overflow” the rest • How collections are often stored in ORDBMS • We may try to cluster XML trees on the same page, as “interpreted BLOBs” • This is Natix’s approach (and also IBM’s DB2) • Pros and cons of these approaches?
Challenges of the Page-per-Tree Approach • How big of a tree? • What happens if the XML overflows the tree? • Natix claims an adaptive approach to choosing the tree’s granularity • Primarily based on balancing the tree, constraints on children that must appear with a parent • What other possibilities make sense? • Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages
Split point in parent page Example Note “proxy” nodes
That Was Simple – But What about Updates? • Clearly, insertions and deletions can affect things • Deletion may ultimately require us to rebalance • Ditto with insertion • But insertion also may make us run out of space – what to do? • Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree • Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order
Does this Help? • According to general lore, yes • The Natix experiments in this paper were limited in their query and adaptivity loads • But the IBM guys say their approach, which is similar, works significantly better than Oracle’s shredded approach
There’s More to Updates than the Pages • What about concurrency control and recovery? • We already have a notion of hierarchical locks, but they claim: • If we want to support IDREF traversal, and indexing directly to nodes, we need more • What’s the idea behind SPP locking?
Logging • They claim ARIES needs some modifications – why? • Their changes: • Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion • Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL • “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree • A few minor tweaks to minimize undo/redo when only one transaction touches a page
Assessment • Native XML storage isn’t really all that different from other means of storage • There are probably some good reasons to make a few tweaks in locking • Optimization stays harder • A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking
Questions • Where are the main challenges of XML processing at this point? • Impact of BinaryXML? • Are we working on the right problems? What’s XML going to be used for, anyway?