390 likes | 513 Views
Storing XML. Based on a tutorial by Sihem Amr-Yahia, given at ICDE 2002. Storing XML. Effective storage – key for efficient XML processing XML demands own storage techniques Characteristics of XML data:
E N D
Storing XML Based on a tutorial by Sihem Amr-Yahia, given at ICDE 2002.
Storing XML • Effective storage – key for efficient XML processing • XML demands own storage techniques • Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content • Characteristics of XML queries: Document order & structure, full-text search, transformation Storing XML
Outline • Introduction • XML Documents • XML Queries • Existing Storage Techniques • Non-native • Native • Physical Storage Features for XML Storing XML
I. Introduction Storing XML
Classes of XML Documents • Structured • “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed • Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms • Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML
Structured Data: HL7 Lab Report Health-care industry data-exchange format <HL7> <PATIENT> <PID IDNum="PATID1234"> <PaNa><FaNa>Jones</FaNa><GiNa>William</GiNa></PaNa> <DTofBi><date>1961-06-13</date></DTofBi> <Sex>M</Sex> </PID> <OBX SetID="1"> <ObsVa>150</ObsVa> <ObsId>Na</ObsId> <AbnFl>Above high</AbnFl> </OBX> ... Storing XML
Queries on Structured Data • Essentially XQuery (already discussed in detail) • Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names • Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations • And so forth. Storing XML
Mixed Data: Library of Congress Documents of U.S. Legislation <bill bill-stage="Introduction""> <congress>110th CONGRESS</congress> <session>1st Session</session> <legis-num>H.R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr.Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name> ... </action-desc> Storing XML
Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML
Properties of XML Data • Variance in structured content • Elements of same type have different structure • Nested sub-element might depend on parent • Direct access to sub-element not required • Order significant in sequence & mixed content • Structured data embedded in text • Schema known a priori or “open content model” • require explicit support in storage system Storing XML
Properties of Queries • Query expressions depend on data properties • Variance • /PATIENT/(SURGERY | CHECK-UP) • Document order: XPath axes • /bill/co-sponsor[./text() = “Mrs.Clinton” and follow-sibling::co-sponsor/text() = “Mr. Torricelli”] • Node identity: equality, union/intersect/except • If not supported in storage system, then operators semantically incorrect or incomplete. (Why?) Storing XML
II. Known Storage Techniques Storing XML
Storage Techniques • Non-native • (Object) Relational, OO, LDAP directories • Indexing, recovery, transactions, updates, optimizers • Mapping from XML to target data model necessary • Captures variance in structured content • No support for mixed content • Recovering XML documents is expensive! • Native • Logical data model is XML • Physical storage features designed for XML Storing XML
Non-native Techniques • Generic • Mapping from XML data to relational tables • Models XML as tree: semi-structured approach • Does not use DTD or XML Schema • Schema-driven • Mapping from schema constructs to relational • Fixed mapping from DTD to relational schema • Flexible mapping from XML Schema to relational • User-defined • Labor-intensive Storing XML
Generic Mappings • Edge relation • store all edges in one table • Scalarvalues stored in separate table • Attribute relations • horizontal partition of Edge relation • Scalar values inlined in same table • Universal relation • full outer-join, redundancy • Captures node identity & document order • Element reconstruction requires multiple joins Storing XML
Edge Relation Example &0 HL7 &1 PATIENT &2 PID OBX &3 &4 … @IDNum PaNa DTofBi &5 &6 &7 PATID1234 “Jones Wm” date &8 1961-06-13 Edge Table Value Table Storing XML
Generic Mappings: LDAP Directories • Flexible schema; easy schema evolution • Supports heterogeneous elements with optional values • Captures node identity & document order • Query language captures subset of XPath Storing XML
LDAP Example XMLElement OC { SUBCLASS OF {XMLNode} MUST CONTAIN {order} MAY CONTAIN {value} TYPE order INTEGER TYPE value STRING } XMLAttribute OC { SUBCLASS OF {XMLNode} MUST CONTAIN {value} TYPE value STRING } oc:XMLElement oid:1 name:PID order: 1 PID @IDNum PaNa DTofBi Sex oc:XMLElement oid:1.2 name: PaNa order: 1 value: Jones Wm oc:XMLAttribute oid:1.1 name: IDNum value: PATID1234 “PATID1234” “Jones Wm” date M 1961-06-13 Storing XML
Schema-driven Mappings • Repetition : separate tables • Non-repeated sub-elements may be “inlined” • Optionality : nullable fields • Choice : multiple tables or universal table • Order : explicit ordinal value • Mixed content ignored • Element reconstruction may require multi-table joins because of normalization Storing XML
Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT (Name, (OBX)*)> <!ELEMENT OBX (Name, Value) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Value (#PCDATA) > PATIENT * OBX Name Value PATIENT OBX • Element with in-degree = 0 or > 1 in DTD graph relation • Elements with in-degree = 1 inlined except those reached by * • Non-* & non-recursive elements with in-degree > 1 inlined. Storing XML
Flexible Mapping : LegoDB • Canonical mapping from XML Schema to relational • Every complex type relation • Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split • Greedy algorithm for choosing mapping • Mapping cost determined by query mix • Use relational optimizer to estimate cost of mapping Storing XML
LegoDB Example • Inline type in parent vs. Outline type in its own relation type OBX = element value { Integer }, type Description type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML
User-Defined Mappings • No automatic translation from DTD or XML Schema • Value-based semantics only • Document structure represented by keys/foreign keys • No explicit representation of document order or node identity • Some support for mixed content Storing XML
Oracle 9i • Canonical mapping into user-defined object-relational tables • Arbitrary XML input • XSLT preprocessing into multiple XML documents, load individually • Stores XML documents in CLOBs (character large objects) • Permits full-text search • Hybrid of canonical mapping & CLOB <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) Storing XML
MS SQL Server • Generic Edge technique with inlined scalar values • User-defined decomposition of XML into multiple tables • XML data mapped into DOM • XPath expressions specify XML values to map into tables • Rows in table Ex:/Customer/Orders row in Table ORDER • Columns in row ./OrderDate OrderDateColumn • Text content modeled in CLOBs Storing XML
Native Techniques • Built from scratch • NatiX (University of Mannheim, Germany) • Xyleme (France) • Xindice (Apache – open source) • TIMBER (U. of Michigan; uses Shore). • Re-tool existing systems to handle XML • Tamino: hierarchical database (ADABAS) • Excelon: OODB • Design efficient data structures for compact storage and fast access; data partitioning; indexing on both values and structure Storing XML
NatiX • Unit of storage = element • Elements clustered to minimize page hits • Inter-element pointers capture document structure • Low-level algorithmic support for read/write/insert/delete operations • No use of DTDs or XML Schema Storing XML
Xyleme • Data layout: based on NatiX • Indexing: sophisticated indexing of text and elements • Query support: XPATH, XQuery, updates • A data warehouse for XML content: store, classify, index, integrate, query and monitor massive volumes of XML content • Semantic services: extensible thesauri and schema mappers that enable the system to go beyond simple indexing Storing XML
Software A/G Tamino • Extends Adabas – nested relations • Indexing: value and structure • Query support: • Full-text search operators • Queries return entire document or some projection of document • No construction of new XML values (unlike XQuery) • Access control at the node level, transactions; multi-media; triggers; backup/restore; compression; support for multi-media documents, e.g., video Storing XML
TIMBER • Underlying storage manager – SHORE • Store XML documents as trees in preorder • Use node pedigrees ([StartPos, EndPos, Level]) in an essential way to capture order, identity. • Support index on tag, value, and on pedigree. • Underlying algebra – TAX (tree algebra for XML). • Underlying physical algebra – key operator: structural join, and its semi-, outer-, and anti- variants. • Optimal join ordering – problem similar to RDBs. • Cost of constructing output dominated by that of finding valid bindings from DB for query variables. • Tree pattern and generalized tree pattern match. Storing XML
Other Native Systems • Xindice http://xml.apache.org/xindice/ • Query support: XPath for its query language and XML:DB XUpdate for its update language • APIs: XML:DB API for Java development; other languages using an available XML-RPC plugin • GoXML • XQuery, full text searching • tree insert, replace and delete Storing XML
Update Support • XQuery does not support updates (yet…) • How to update? • Flat streams: overwrite document • Non-native: SQL • Native: DOM, proprietary APIs • But how do you know you have not violated schema for which the mapping was defined? • Flat streams: re-parse document (how do we check ICs?) • Non-native: need to understand the mapping and maintain integrity constraints • Native: supported in some systems (e.g., eXcelon) Storing XML
Summary • Non-native • Treats target system as black box • Mismatch between data models requires mapping • Supporting order-sensitive queries can be expensive • May require changes to schema to support new tags • Introduces redundancies & necessity of joins • No control of physical layout of data • Native • No mismatch between logical data models • Focus on physical layout (clustering, indices, …) • Extensible - no schema or DTD needed Storing XML
Conclusion • XML data requires new storage features • Real-applications depend upon XML data properties • Normalization is not always appropriate • Schema of XML data should drive storage • Real-world data comes with its own schema • Schema as a basis for querying • Handling mixed content is an important research problem Storing XML
More Resources • W3C Documents http://www.w3.org/TR/ • W3C XML Query page http://www.w3.org/XML/Query.html • XML Query Implementations & Demos Galax - AT&T, Lucent, and Avaya http://www-db.research.bell-labs.com/galax/ Quip - Software AG http://www.softwareag.com/developer/quip/ XQuery demo – Microsoft http://131.107.228.20/xquerydemo/ Fraunhofer IPSI XQuery Prototype http://xml.ipsi.fhg.de/xquerydemo/ XQengine – Fatdog http://www.fatdog.com/ X-Hive http://217.77.130.189/xquery/index.html OpenLink http://demo.openlinksw.com:8391/xquery/demo.vsp Storing XML
Serge Abiteboul,Sophie Cluet,Tova Milo: Querying and Updating the File. VLDB 1993 D. Barbosa,A. Barta,A. Mendelzon,G. Mihaila, F. Rizzolo, P. Rodriguez-Gianolli: ToX – The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001. Phil Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A cost-based Approach to XML Storage. ICDE 2002 Michael J. Carey,Jerry Kiernan, Jayavel Shanmugasundaram, Eugene J. Shekita, Subbu N. Subramanian: XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents.VLDB 2000 Qiming Chen, Yahiko Kambayashi: Nested Relation Based Database Knowledge Representation. SIGMOD Conference 1991 Vassilis Christophides, Sophie Cluet, Jérôme Siméon: On Wrapping Query Languages and Efficient XML Integration. SIGMOD Conference 2000: 141-152 Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999 Daniela Florescu, Donald Kossman: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Eng. Bulletin 1999 Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: XTRACT: A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Conference 2000 Roy Goldman, Jennifer Widom: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997 References (Research) Storing XML
P.J. Marron, G. Lausen: On Processing XML in LDAP, VLDB 2001 Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data. Technical Report 8/99, University of Mannheim, 1999 Feng Tian, David J. DeWitt, Jianjun Chen, and Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies, Technical report, University of Wisconsin Masatoshi Yoshikawa, Takeyuki Shimura, Shunsuke Uemura: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, Guy M. Lohman: On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001 Justin Zobel, James A. Thom,Ron Sacks-Davis: Efficiency of Nested Relational Document Database Systems. VLDB 1991 References (Research) Storing XML
W3C Recommendation. Extensible Markup Language (XML) 1.0 (Second Edition) In http://www.w3.org/TR/REC-xml. 2000 W3C Recommendation. Namespaces in XML In http://www.w3.org/TR/REC-xml-names. 1999 W3C Working Draft. XML Path Language (XPath) 2.0. In http://www.w3.org/TR/xpath20. 2001 W3C XML representation of a relational database In http://www.w3.org/XML/RDB. html W3C Recommendation.XML Schema Part 0: Primer In http://www.w3.org/TR/xmlschema-0. 2001 W3C Recommendation. XML Schema Part 1: Structures In http://www.w3.org/TR/xmlschema1. 2001 W3C Recommendation. XML Schema Part 1: Datatypes In http://www.w3.org/TR/xmlschema-2. 2001 W3C Recommendation. XSL Transformations (XSLT) 1.0. In http://www.w3.org/TR/xslt. 1999 W3C Working Draft. XQuery 1.0: An XML Query Language In http://www.w3.org/TR/xquery. 2001 References (W3C) Storing XML
References (Products) • Ronald Bourret: XML Database Products: In http://www.rpbourret.com/xml/XMLDatabaseProds.htm, July 2001 • Sandeepan Banerjee, Vishu Krishnamurthy, Muralidhar Krishnaprasad, Ravi Murthy: Oracle8i - The XML Enabled Data Management System. ICDE 2000 • Oracle9i Application Developer's Guide – XML Release 1 (9.0.1) • eXcelon: Extensible Information Server White Paper. eXcelon Corporation, 2001 • Josephine M.Cheng, Jane Xu: XML and DB2. ICDE 2000: 569-573 • IBM DB2 Universal Database XML Extender Administration and Programming Version 7. 2001 • Microsoft SQL Server Books Online • Michael Rys: Bringing the Internet to Your Database:Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. ICDE 2001: 465-472 Storing XML