1 / 39

Storing XML

Storing XML. Based on a tutorial by Sihem Amr-Yahia, given at ICDE 2002. Storing XML. Effective storage – key for efficient XML processing XML demands own storage techniques Characteristics of XML data:

ferris
Download Presentation

Storing XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storing XML Based on a tutorial by Sihem Amr-Yahia, given at ICDE 2002.

  2. Storing XML • Effective storage – key for efficient XML processing • XML demands own storage techniques • Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content • Characteristics of XML queries: Document order & structure, full-text search, transformation Storing XML

  3. Outline • Introduction • XML Documents • XML Queries • Existing Storage Techniques • Non-native • Native • Physical Storage Features for XML Storing XML

  4. I. Introduction Storing XML

  5. Classes of XML Documents • Structured • “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed • Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms • Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML

  6. Structured Data: HL7 Lab Report Health-care industry data-exchange format <HL7> <PATIENT> <PID IDNum="PATID1234"> <PaNa><FaNa>Jones</FaNa><GiNa>William</GiNa></PaNa> <DTofBi><date>1961-06-13</date></DTofBi> <Sex>M</Sex> </PID> <OBX SetID="1"> <ObsVa>150</ObsVa> <ObsId>Na</ObsId> <AbnFl>Above high</AbnFl> </OBX> ... Storing XML

  7. Queries on Structured Data • Essentially XQuery (already discussed in detail) • Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names • Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations • And so forth. Storing XML

  8. Mixed Data: Library of Congress Documents of U.S. Legislation <bill bill-stage="Introduction""> <congress>110th CONGRESS</congress> <session>1st Session</session> <legis-num>H.R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr.Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name> ... </action-desc> Storing XML

  9. Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML

  10. Properties of XML Data • Variance in structured content • Elements of same type have different structure • Nested sub-element might depend on parent • Direct access to sub-element not required • Order significant in sequence & mixed content • Structured data embedded in text • Schema known a priori or “open content model” • require explicit support in storage system Storing XML

  11. Properties of Queries • Query expressions depend on data properties • Variance • /PATIENT/(SURGERY | CHECK-UP) • Document order: XPath axes • /bill/co-sponsor[./text() = “Mrs.Clinton” and follow-sibling::co-sponsor/text() = “Mr. Torricelli”] • Node identity: equality, union/intersect/except • If not supported in storage system, then operators semantically incorrect or incomplete. (Why?) Storing XML

  12. II. Known Storage Techniques Storing XML

  13. Storage Techniques • Non-native • (Object) Relational, OO, LDAP directories • Indexing, recovery, transactions, updates, optimizers • Mapping from XML to target data model necessary • Captures variance in structured content • No support for mixed content • Recovering XML documents is expensive! • Native • Logical data model is XML • Physical storage features designed for XML Storing XML

  14. Non-native Techniques • Generic • Mapping from XML data to relational tables • Models XML as tree: semi-structured approach • Does not use DTD or XML Schema • Schema-driven • Mapping from schema constructs to relational • Fixed mapping from DTD to relational schema • Flexible mapping from XML Schema to relational • User-defined • Labor-intensive Storing XML

  15. Generic Mappings • Edge relation • store all edges in one table • Scalarvalues stored in separate table • Attribute relations • horizontal partition of Edge relation • Scalar values inlined in same table • Universal relation • full outer-join, redundancy • Captures node identity & document order • Element reconstruction requires multiple joins Storing XML

  16. Edge Relation Example &0 HL7 &1 PATIENT &2 PID OBX &3 &4 … @IDNum PaNa DTofBi &5 &6 &7 PATID1234 “Jones Wm” date &8 1961-06-13 Edge Table Value Table Storing XML

  17. Generic Mappings: LDAP Directories • Flexible schema; easy schema evolution • Supports heterogeneous elements with optional values • Captures node identity & document order • Query language captures subset of XPath Storing XML

  18. LDAP Example XMLElement OC { SUBCLASS OF {XMLNode} MUST CONTAIN {order} MAY CONTAIN {value} TYPE order INTEGER TYPE value STRING } XMLAttribute OC { SUBCLASS OF {XMLNode} MUST CONTAIN {value} TYPE value STRING } oc:XMLElement oid:1 name:PID order: 1 PID @IDNum PaNa DTofBi Sex oc:XMLElement oid:1.2 name: PaNa order: 1 value: Jones Wm oc:XMLAttribute oid:1.1 name: IDNum value: PATID1234 “PATID1234” “Jones Wm” date M 1961-06-13 Storing XML

  19. Schema-driven Mappings • Repetition : separate tables • Non-repeated sub-elements may be “inlined” • Optionality : nullable fields • Choice : multiple tables or universal table • Order : explicit ordinal value • Mixed content ignored • Element reconstruction may require multi-table joins because of normalization Storing XML

  20. Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT (Name, (OBX)*)> <!ELEMENT OBX (Name, Value) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Value (#PCDATA) > PATIENT * OBX Name Value PATIENT OBX • Element with in-degree = 0 or > 1 in DTD graph  relation • Elements with in-degree = 1 inlined except those reached by * • Non-* & non-recursive elements with in-degree > 1 inlined. Storing XML

  21. Flexible Mapping : LegoDB • Canonical mapping from XML Schema to relational • Every complex type  relation • Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split • Greedy algorithm for choosing mapping • Mapping cost determined by query mix • Use relational optimizer to estimate cost of mapping Storing XML

  22. LegoDB Example • Inline type in parent vs. Outline type in its own relation type OBX = element value { Integer }, type Description type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML

  23. User-Defined Mappings • No automatic translation from DTD or XML Schema • Value-based semantics only • Document structure represented by keys/foreign keys • No explicit representation of document order or node identity • Some support for mixed content Storing XML

  24. Oracle 9i • Canonical mapping into user-defined object-relational tables • Arbitrary XML input • XSLT preprocessing into multiple XML documents, load individually • Stores XML documents in CLOBs (character large objects) • Permits full-text search • Hybrid of canonical mapping & CLOB <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) Storing XML

  25. MS SQL Server • Generic Edge technique with inlined scalar values • User-defined decomposition of XML into multiple tables • XML data mapped into DOM • XPath expressions specify XML values to map into tables • Rows in table Ex:/Customer/Orders  row in Table ORDER • Columns in row ./OrderDate  OrderDateColumn • Text content modeled in CLOBs Storing XML

  26. Native Techniques • Built from scratch • NatiX (University of Mannheim, Germany) • Xyleme (France) • Xindice (Apache – open source) • TIMBER (U. of Michigan; uses Shore). • Re-tool existing systems to handle XML • Tamino: hierarchical database (ADABAS) • Excelon: OODB • Design efficient data structures for compact storage and fast access; data partitioning; indexing on both values and structure Storing XML

  27. NatiX • Unit of storage = element • Elements clustered to minimize page hits • Inter-element pointers capture document structure • Low-level algorithmic support for read/write/insert/delete operations • No use of DTDs or XML Schema Storing XML

  28. Xyleme • Data layout: based on NatiX • Indexing: sophisticated indexing of text and elements • Query support: XPATH, XQuery, updates • A data warehouse for XML content: store, classify, index, integrate, query and monitor massive volumes of XML content • Semantic services: extensible thesauri and schema mappers that enable the system to go beyond simple indexing Storing XML

  29. Software A/G Tamino • Extends Adabas – nested relations • Indexing: value and structure • Query support: • Full-text search operators • Queries return entire document or some projection of document • No construction of new XML values (unlike XQuery) • Access control at the node level, transactions; multi-media; triggers; backup/restore; compression; support for multi-media documents, e.g., video Storing XML

  30. TIMBER • Underlying storage manager – SHORE • Store XML documents as trees in preorder • Use node pedigrees ([StartPos, EndPos, Level]) in an essential way to capture order, identity. • Support index on tag, value, and on pedigree. • Underlying algebra – TAX (tree algebra for XML). • Underlying physical algebra – key operator: structural join, and its semi-, outer-, and anti- variants. • Optimal join ordering – problem similar to RDBs. • Cost of constructing output dominated by that of finding valid bindings from DB for query variables. • Tree pattern and generalized tree pattern match. Storing XML

  31. Other Native Systems • Xindice http://xml.apache.org/xindice/ • Query support: XPath for its query language and XML:DB XUpdate for its update language • APIs: XML:DB API for Java development; other languages using an available XML-RPC plugin • GoXML • XQuery, full text searching • tree insert, replace and delete Storing XML

  32. Update Support • XQuery does not support updates (yet…) • How to update? • Flat streams: overwrite document • Non-native: SQL • Native: DOM, proprietary APIs • But how do you know you have not violated schema for which the mapping was defined? • Flat streams: re-parse document (how do we check ICs?) • Non-native: need to understand the mapping and maintain integrity constraints • Native: supported in some systems (e.g., eXcelon) Storing XML

  33. Summary • Non-native • Treats target system as black box • Mismatch between data models requires mapping • Supporting order-sensitive queries can be expensive • May require changes to schema to support new tags • Introduces redundancies & necessity of joins • No control of physical layout of data • Native • No mismatch between logical data models • Focus on physical layout (clustering, indices, …) • Extensible - no schema or DTD needed Storing XML

  34. Conclusion • XML data requires new storage features • Real-applications depend upon XML data properties • Normalization is not always appropriate • Schema of XML data should drive storage • Real-world data comes with its own schema • Schema as a basis for querying • Handling mixed content is an important research problem Storing XML

  35. More Resources • W3C Documents http://www.w3.org/TR/ • W3C XML Query page http://www.w3.org/XML/Query.html • XML Query Implementations & Demos Galax - AT&T, Lucent, and Avaya http://www-db.research.bell-labs.com/galax/ Quip - Software AG http://www.softwareag.com/developer/quip/ XQuery demo – Microsoft http://131.107.228.20/xquerydemo/ Fraunhofer IPSI XQuery Prototype http://xml.ipsi.fhg.de/xquerydemo/ XQengine – Fatdog http://www.fatdog.com/ X-Hive http://217.77.130.189/xquery/index.html OpenLink http://demo.openlinksw.com:8391/xquery/demo.vsp Storing XML

  36. Serge Abiteboul,Sophie Cluet,Tova Milo: Querying and Updating the File. VLDB 1993 D. Barbosa,A. Barta,A. Mendelzon,G. Mihaila, F. Rizzolo, P. Rodriguez-Gianolli: ToX – The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001. Phil Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A cost-based Approach to XML Storage. ICDE 2002 Michael J. Carey,Jerry Kiernan, Jayavel Shanmugasundaram, Eugene J. Shekita, Subbu N. Subramanian: XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents.VLDB 2000 Qiming Chen, Yahiko Kambayashi: Nested Relation Based Database Knowledge Representation. SIGMOD Conference 1991 Vassilis Christophides, Sophie Cluet, Jérôme Siméon: On Wrapping Query Languages and Efficient XML Integration. SIGMOD Conference 2000: 141-152 Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999 Daniela Florescu, Donald Kossman: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Eng. Bulletin 1999 Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: XTRACT: A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Conference 2000 Roy Goldman, Jennifer Widom: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997 References (Research) Storing XML

  37. P.J. Marron, G. Lausen: On Processing XML in LDAP, VLDB 2001 Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data. Technical Report 8/99, University of Mannheim, 1999 Feng Tian, David J. DeWitt, Jianjun Chen, and Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies, Technical report, University of Wisconsin Masatoshi Yoshikawa, Takeyuki Shimura, Shunsuke Uemura: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, Guy M. Lohman: On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001 Justin Zobel, James A. Thom,Ron Sacks-Davis: Efficiency of Nested Relational Document Database Systems. VLDB 1991 References (Research) Storing XML

  38. W3C Recommendation. Extensible Markup Language (XML) 1.0 (Second Edition) In http://www.w3.org/TR/REC-xml. 2000 W3C Recommendation. Namespaces in XML In http://www.w3.org/TR/REC-xml-names. 1999 W3C Working Draft. XML Path Language (XPath) 2.0. In http://www.w3.org/TR/xpath20. 2001 W3C XML representation of a relational database In http://www.w3.org/XML/RDB. html W3C Recommendation.XML Schema Part 0: Primer In http://www.w3.org/TR/xmlschema-0. 2001 W3C Recommendation. XML Schema Part 1: Structures In http://www.w3.org/TR/xmlschema1. 2001 W3C Recommendation. XML Schema Part 1: Datatypes In http://www.w3.org/TR/xmlschema-2. 2001 W3C Recommendation. XSL Transformations (XSLT) 1.0. In http://www.w3.org/TR/xslt. 1999 W3C Working Draft. XQuery 1.0: An XML Query Language In http://www.w3.org/TR/xquery. 2001 References (W3C) Storing XML

  39. References (Products) • Ronald Bourret: XML Database Products: In http://www.rpbourret.com/xml/XMLDatabaseProds.htm, July 2001 • Sandeepan Banerjee, Vishu Krishnamurthy, Muralidhar Krishnaprasad, Ravi Murthy: Oracle8i - The XML Enabled Data Management System. ICDE 2000 • Oracle9i Application Developer's Guide – XML Release 1 (9.0.1) • eXcelon: Extensible Information Server White Paper. eXcelon Corporation, 2001 • Josephine M.Cheng, Jane Xu: XML and DB2. ICDE 2000: 569-573 • IBM DB2 Universal Database XML Extender Administration and Programming Version 7. 2001 • Microsoft SQL Server Books Online • Michael Rys: Bringing the Internet to Your Database:Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. ICDE 2001: 465-472 Storing XML

More Related