660 likes | 1.01k Views
Storing XML. Sihem Amer-Yahia AT&T Labs - Research. What’s XML?. W3C Standard since 1998 Subset of SGML (ISO Standard Generalized Markup Language) Data-description markup language HTML text-rendering markup language De facto format for data exchange on Internet Electronic commerce
E N D
Storing XML Sihem Amer-Yahia AT&T Labs - Research
What’s XML? • W3C Standard since 1998 • Subset of SGML • (ISO Standard Generalized Markup Language) • Data-description markup language • HTML text-rendering markup language • De facto format for data exchange on Internet • Electronic commerce • Business-to-business (B2B) communication Storing XML
XML: A Wire Protocol • XML = A minimal wire representation for data and storage exchange • A low-level wire transfer format – like IP in networking • Minimal level of standardization for distributed components to interoperate • Platform, language and vendor agnostic • Easy to understand and extensible • Data exchange enabled via XML transformations Storing XML
Core XML Technologies • XML Validation: Contract for Data Exchange • DTD, Relax N/G, XML Schema • XML API: Programmatic Access to XML • DOM, SAX • Transformation Languages for Data Exchange and Display • XSL, XSLT, XPATH, XQuery Storing XML
XML Data Model Highlights • Tagged elements describe semantics of data • Easier to parse for a machine and for a human • Element may have attributes • Element can contain nested sub-elements • Sub-elements may themselves be tagged elements or character data • Tree structure • Can capture any data-model • Easier to navigate Storing XML
An XML Document <? xml version=" 1.0"?> <! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"> <sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue> Storing XML
Document Type Definition (DTD) • An XML document may have a DTD • Grammar for describing document structure • Terminology • well-formed: if tags are correctly closed • valid: if it has a DTD and conforms to it • Validation useful for data exchange Storing XML
W3C XML Schema • Rich set of scalar types • user-defined simple types • Complex types factor common structure • Sequences, choice, repetition, recursion of elements • Sub-typing supports schema reuse • Integrity constraints Storing XML
DTD vs XML Schema • DTD <! ELEMENT article (title, initPage, endPage, author) > <! ELEMENT title (# PCDATA)> <! ELEMENT initPage (# PCDATA)> <! ELEMENT endPage (# PCDATA)> <! ELEMENT author (# PCDATA)> • XML Schema <xsd: element name=" article" minOccurs=" 0" maxOccurs=" unbounded"> <xsd: complexType> <xsd: sequence> <xsd: element name=" title" type=" xsd: string"/> <xsd: element name=" initPage" type=" xsd: string"/> <xsd: element name=" endPage" type=" xsd: string"/> <xsd: element name=" author" type=" xsd: string"/> </ xsd: sequence> </ xsd: complexType> </ xsd: element> Storing XML
XML API: DOM • Hierarchical (tree) object model for XML documents • Associate a list of children with every node (or text value) • Preserves sequence of elements in XML document • May be expensive to materialize for a large XML collection Storing XML
DOM Features • DOM API supports: • Navigation: access all attribute nodes, children, first/last child, next/previous sibling, parent,… • Creation: create new node • Modification: append, insert, remove, replace node • DOM parser support for validation • Most support DTD • Some support XML Schema • See : http://www.w3.org/XML/Schema Storing XML
XML API: SAX • Event-driven: fire an event for every open tag/end tag • Does not require full parsing: reads XML document in streaming fashion • Read-only interface • Consumes less memory than DOM • Could be significantly faster than DOM Storing XML
SAX Features • Stack-oriented (LIFO) access • Read-once processing of very large documents • E.g., load XML document into a storage system • SAX parser support for validation • Most support DTD • Microsoft XML Parser (MSXML) supports XML Schema Storing XML
XSL • Styling is rendering information for consumption • XSL = A language to express styling (“Stylesheet language”) • Two components of a stylesheet • Transform: Source to a target tree using template rules expressed in XSLT • Format: Controls appearance Storing XML
XSLT • XPATH acts as the pattern language • Primary goal is to transform XML vocabularies to XSL formatting vocabularies • But, often adequate for many transformation needs Storing XML
XPATH • [www.w3.org/TR/xpath] • Common sub-language of • XSLT a loosely-typed, "scripting" language • XQuery a strongly-typed, query language • Syntax for tree navigation and node selection • Navigation is described using location paths Storing XML
XPATH • . : current node • .. : parent of the current node • / : root node, or a separator between steps in a path • // : descendants of the current node • @ : attributes of the current node • * : "any“ (node with unrestricted name) • [] : a predicate for a given step • [n] : the element with the given ordinal number from a list of elements Storing XML
XPATH 2.0 • Arithmetic Expr+,-,*,div,modExpr • Logical Expror/andExprnot(Expr) • Comparison Expr=,!=,<=,>= Expr • Conditional if Expr then Expr else Expr • IterationforVarinExprreturnExpr • Quantifiedsome/everyVarinExprsatisfiesExpr Storing XML
XPATH Example • List the titles of articles in which the author has “Tom Hanks” • //article[//author=“Tom Hanks”]/title • Find the titles of articles authored by “Tom Hanks” in volume 1. • //issue[/volume=“1”]/articles/article/[//author=“TomHanks”]/title Storing XML
Beyond XPATH • Joining, aggregating XML from multiple documents • Constructing new XML • Recursive processing of recursive XML data • Supported by XSLT & XQuery • Differences between XSLT & XQuery • Safety: XQuery enforces input & output types • Compositionality : XQuery maps XML to XML; XSLT maps XML to anything Storing XML
XQuery • Functional language • Query is an expression • Expressions are recursively constructed • Includes XPATH as a sub-language • SQL-like FLWR expression • Borrows features from many other languages: XQL, XML-QL, ML,.. Storing XML
XQuery: FLWR expression • FOR/LET Clauses • Ordered list of tuples of bound variables • WHERE Clause • Pruned list of tuples of bound variables • RETURN Clause • Instance of XML Query data model Storing XML
XQuery: Example List the titles of the articles authored by “Tom Hanks” Query Expression for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return <title>$b/title.text()</title> Query Result <title>XML Research Issues</title> Storing XML
XQuery: Example List the articles authored by “Tom Hanks”. Query Expression <articles> { for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return $b } </articles> Query Result <articles> <article> <title>XML:Where are we heading for?</title> <initPage>6</initPage> <endPage>10</endPage> <authors><author AuthorPosition="00">Tom Hanks</author> </authors> </article> </articles> Storing XML
? Business Application Logic Wrap SOAP/CORBA/Java RMI Where’s the XML Data? ? Export Legacy databases Import Warehouse XML data View Minimal result Storing XML
XML and Databases • Data stored in SQL databases need to be published in XML for data exchange • Specification schemes for publishing needed • Efficient publishing algorithms needed • Storage and retrieval of XML documents • Need to support mapping schemes • Need to support data manipulation XML API-s Storing XML
Storing XML • Storage foundation of efficient XML processing • XML demands own storage techniques • Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content • Characteristics of XML queries: Document order & structure, full-text search, transformation • Goals of tutorial • Existing storage features for XML • New storage features for XML Storing XML
Outline • Introduction • XML Documents • XML Queries • Existing Storage Techniques • Non-native • Native • Physical Storage Features for XML Storing XML
I. Introduction Storing XML
Classes of XML Documents • Structured • “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed • Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms • Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML
Structured Data: HL7 Lab Report Health-care industry data-exchange format <HL7> <PATIENT> <PID IDNum="PATID1234"> <PaNa><FaNa>Jones</FaNa><GiNa>William</GiNa></PaNa> <DTofBi><date>1961-06-13</date></DTofBi> <Sex>M</Sex> </PID> <OBX SetID="1"> <ObsVa>150</ObsVa> <ObsId>Na</ObsId> <AbnFl>Above high</AbnFl> </OBX> ... Storing XML
Queries on Structured Data • Analogs of SQL • Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names • Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations Storing XML
Mixed Data: Library of Congress Documents of U.S. Legislation <bill bill-stage="Introduction""> <congress>110th CONGRESS</congress> <session>1st Session</session> <legis-num>H.R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr.Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name> ... </action-desc> Storing XML
Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML
Properties of XML Data • Variance in structured content • Elements of same type have different structure • Nested sub-element might depend on parent • Direct access to sub-element not required • Order significant in sequence & mixed content • Structured data embedded in text • Schema known a priori or “open content model” • Desirable: explicit support in storage system Storing XML
Properties of Queries • Query expressions depend on data properties • Variance • /PATIENT/(SURGERY | CHECK-UP) • Document order: XPath axes • /bill/co-sponsor[./text() = “Mrs.Clinton” and follow-sibling::co-sponsor/text() = “Mr. Torricelli”] • Node identity: equality, union/intersect/except • If not supported in storage system, then operators semantically incorrect or incomplete. Storing XML
II. Existing Storage Techniques Storing XML
Storage Techniques • Non-native • (Object) Relational, OO, LDAP directories • Indexing, recovery, transactions, updates, optimizers • Mapping from XML to target data model necessary • Captures variance in structured content • No support for mixed content • Recovering XML documents is expensive! • Native • Logical data model is XML • Physical storage features designed for XML Storing XML
Non-native Techniques • Generic • Mapping from XML data to relational tables • Models XML as tree: semi-structured approach • Does not use DTD or XML Schema • Schema-driven • Mapping from schema constructs to relational • Fixed mapping from DTD to relational schema • Flexible mapping from XML Schema to relational • User-defined • Labor-intensive Storing XML
Generic Mappings • Edge relation • store all edges in one table • Scalarvalues stored in separate table • Attribute relations • horizontal partition of Edge relation • Scalar values inlined in same table • Universal relation • full outer-join, redundancy • Captures node identity & document order • Element reconstruction requires multiple joins Storing XML
Edge Relation Example &0 HL7 &1 PATIENT &2 PID OBX &3 &4 … @IDNum PaNa DTofBi &5 &6 &7 PATID1234 “Jones Wm” date &8 1961-06-13 Edge Table Value Table Storing XML
Generic Mappings: LDAP Directories • Flexible schema; easy schema evolution • Supports heterogeneous elements with optional values • Captures node identity & document order • Query language captures subset of XPath Storing XML
LDAP Example XMLElement OC { SUBCLASS OF {XMLNode} MUST CONTAIN {order} MAY CONTAIN {value} TYPE order INTEGER TYPE value STRING } XMLAttribute OC { SUBCLASS OF {XMLNode} MUST CONTAIN {value} TYPE value STRING } oc:XMLElement oid:1 name:PID order: 1 PID @IDNum PaNa DTofBi Sex oc:XMLElement oid:1.2 name: PaNa order: 1 value: Jones Wm oc:XMLAttribute oid:1.1 name: IDNum value: PATID1234 “PATID1234” “Jones Wm” date M 1961-06-13 Storing XML
Schema-driven Mappings • Repetition : separate tables • Non-repeated sub-elements may be “inlined” • Optionality : nullable fields • Choice : multiple tables or universal table • Order : explicit ordinal value • Mixed content ignored • Element reconstruction may require multi-table joins because of normalization Storing XML
Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT (Name, (OBX)*)> <!ELEMENT OBX (Name, Value) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Value (#PCDATA) > PATIENT * OBX Name Value PATIENT OBX • Element with in-degree = 0 or > 1 in DTD graph relation • Elements with in-degree = 1 inlined except those reached by * • Non-* & non-recursive elements with in-degree > 1 inlined Storing XML
Flexible Mapping : LegoDB • Canonical mapping from XML Schema to relational • Every complex type relation • Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split • Greedy algorithm for choosing mapping • Mapping cost determined by query mix • Use relational optimizer to estimate cost of mapping Storing XML
LegoDB Example • Inline type in parent vs. Outline type in own relation type OBX = element value { Integer }, type Description type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML
User-Defined Mappings • No automatic translation from DTD or XML Schema • Annotated schemas or special-purpose queries • Value-based semantics only • Document structure represented by keys/foreign keys • No explicit representation of document order or node identity • Some support for mixed content Storing XML
Oracle 9i • Canonical mapping into user-defined object-relational tables • Arbitrary XML input • XSLT preprocessing into multiple XML documents, load individually • Stores XML documents in CLOBs (character large objects) • Permits full-text search • Hybrid of canonical mapping & CLOB <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) Storing XML
IBM DB2 XML Extender • Declarative decomposition of arbitrary XML • Pure relational mapping (no object features used) <element_node name="Order"> <table name="order_tab"/> <table name="part_tab"/> <condition> order_tab.order_key = part_tab.order_key </condition> <attribute_node name="key"> <table name="order_tab"/> <column name="order_key"/> </attribute_node> </element_node> • Mixed content CLOBs + side tables for indexing structured data embedded in text Storing XML