660 likes | 680 Views
Explore the fundamentals and advanced concepts of XML, including technologies like XML Schema, DOM, XSL, and XQuery. Learn how XML enables data exchange and storage, validation, and transformation, making it a vital tool for modern business communication.
E N D
Storing XML Sihem Amer-Yahia AT&T Labs - Research
What’s XML? • W3C Standard since 1998 • Subset of SGML • (ISO Standard Generalized Markup Language) • Data-description markup language • HTML text-rendering markup language • De facto format for data exchange on Internet • Electronic commerce • Business-to-business (B2B) communication Storing XML
XML: A Wire Protocol • XML = A minimal wire representation for data and storage exchange • A low-level wire transfer format – like IP in networking • Minimal level of standardization for distributed components to interoperate • Platform, language and vendor agnostic • Easy to understand and extensible • Data exchange enabled via XML transformations Storing XML
Core XML Technologies • XML Validation: Contract for Data Exchange • DTD, Relax N/G, XML Schema • XML API: Programmatic Access to XML • DOM, SAX • Transformation Languages for Data Exchange and Display • XSL, XSLT, XPATH, XQuery Storing XML
XML Data Model Highlights • Tagged elements describe semantics of data • Easier to parse for a machine and for a human • Element may have attributes • Element can contain nested sub-elements • Sub-elements may themselves be tagged elements or character data • Tree structure • Can capture any data-model • Easier to navigate Storing XML
An XML Document <? xml version=" 1.0"?> <! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"> <sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue> Storing XML
Document Type Definition (DTD) • An XML document may have a DTD • Grammar for describing document structure • Terminology • well-formed: if tags are correctly closed • valid: if it has a DTD and conforms to it • Validation useful for data exchange Storing XML
W3C XML Schema • Rich set of scalar types • user-defined simple types • Complex types factor common structure • Sequences, choice, repetition, recursion of elements • Sub-typing supports schema reuse • Integrity constraints Storing XML
DTD vs XML Schema • DTD <! ELEMENT article (title, initPage, endPage, author) > <! ELEMENT title (# PCDATA)> <! ELEMENT initPage (# PCDATA)> <! ELEMENT endPage (# PCDATA)> <! ELEMENT author (# PCDATA)> • XML Schema <xsd: element name=" article" minOccurs=" 0" maxOccurs=" unbounded"> <xsd: complexType> <xsd: sequence> <xsd: element name=" title" type=" xsd: string"/> <xsd: element name=" initPage" type=" xsd: string"/> <xsd: element name=" endPage" type=" xsd: string"/> <xsd: element name=" author" type=" xsd: string"/> </ xsd: sequence> </ xsd: complexType> </ xsd: element> Storing XML
XML API: DOM • Hierarchical (tree) object model for XML documents • Associate a list of children with every node (or text value) • Preserves sequence of elements in XML document • May be expensive to materialize for a large XML collection Storing XML
DOM Features • DOM API supports: • Navigation: access all attribute nodes, children, first/last child, next/previous sibling, parent,… • Creation: create new node • Modification: append, insert, remove, replace node • DOM parser support for validation • Most support DTD • Some support XML Schema • See : http://www.w3.org/XML/Schema Storing XML
XML API: SAX • Event-driven: fire an event for every open tag/end tag • Does not require full parsing: reads XML document in streaming fashion • Read-only interface • Consumes less memory than DOM • Could be significantly faster than DOM Storing XML
SAX Features • Stack-oriented (LIFO) access • Read-once processing of very large documents • E.g., load XML document into a storage system • SAX parser support for validation • Most support DTD • Microsoft XML Parser (MSXML) supports XML Schema Storing XML
XSL • Styling is rendering information for consumption • XSL = A language to express styling (“Stylesheet language”) • Two components of a stylesheet • Transform: Source to a target tree using template rules expressed in XSLT • Format: Controls appearance Storing XML
XSLT • XPATH acts as the pattern language • Primary goal is to transform XML vocabularies to XSL formatting vocabularies • But, often adequate for many transformation needs Storing XML
XPATH • [www.w3.org/TR/xpath] • Common sub-language of • XSLT a loosely-typed, "scripting" language • XQuery a strongly-typed, query language • Syntax for tree navigation and node selection • Navigation is described using location paths Storing XML
XPATH • . : current node • .. : parent of the current node • / : root node, or a separator between steps in a path • // : descendants of the current node • @ : attributes of the current node • * : "any“ (node with unrestricted name) • [] : a predicate for a given step • [n] : the element with the given ordinal number from a list of elements Storing XML
XPATH 2.0 • Arithmetic Expr+,-,*,div,modExpr • Logical Expror/andExprnot(Expr) • Comparison Expr=,!=,<=,>= Expr • Conditional if Expr then Expr else Expr • IterationforVarinExprreturnExpr • Quantifiedsome/everyVarinExprsatisfiesExpr Storing XML
XPATH Example • List the titles of articles in which the author has “Tom Hanks” • //article[//author=“Tom Hanks”]/title • Find the titles of articles authored by “Tom Hanks” in volume 1. • //issue[/volume=“1”]/articles/article/[//author=“TomHanks”]/title Storing XML
Beyond XPATH • Joining, aggregating XML from multiple documents • Constructing new XML • Recursive processing of recursive XML data • Supported by XSLT & XQuery • Differences between XSLT & XQuery • Safety: XQuery enforces input & output types • Compositionality : XQuery maps XML to XML; XSLT maps XML to anything Storing XML
XQuery • Functional language • Query is an expression • Expressions are recursively constructed • Includes XPATH as a sub-language • SQL-like FLWR expression • Borrows features from many other languages: XQL, XML-QL, ML,.. Storing XML
XQuery: FLWR expression • FOR/LET Clauses • Ordered list of tuples of bound variables • WHERE Clause • Pruned list of tuples of bound variables • RETURN Clause • Instance of XML Query data model Storing XML
XQuery: Example List the titles of the articles authored by “Tom Hanks” Query Expression for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return <title>$b/title.text()</title> Query Result <title>XML Research Issues</title> Storing XML
XQuery: Example List the articles authored by “Tom Hanks”. Query Expression <articles> { for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return $b } </articles> Query Result <articles> <article> <title>XML:Where are we heading for?</title> <initPage>6</initPage> <endPage>10</endPage> <authors><author AuthorPosition="00">Tom Hanks</author> </authors> </article> </articles> Storing XML
? Business Application Logic Wrap SOAP/CORBA/Java RMI Where’s the XML Data? ? Export Legacy databases Import Warehouse XML data View Minimal result Storing XML
XML and Databases • Data stored in SQL databases need to be published in XML for data exchange • Specification schemes for publishing needed • Efficient publishing algorithms needed • Storage and retrieval of XML documents • Need to support mapping schemes • Need to support data manipulation XML API-s Storing XML
Storing XML • Storage foundation of efficient XML processing • XML demands own storage techniques • Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content • Characteristics of XML queries: Document order & structure, full-text search, transformation • Goals of tutorial • Existing storage features for XML • New storage features for XML Storing XML
Outline • Introduction • XML Documents • XML Queries • Existing Storage Techniques • Non-native • Native • Physical Storage Features for XML Storing XML
I. Introduction Storing XML
Classes of XML Documents • Structured • “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed • Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms • Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML
Structured Data: HL7 Lab Report Health-care industry data-exchange format <HL7> <PATIENT> <PID IDNum="PATID1234"> <PaNa><FaNa>Jones</FaNa><GiNa>William</GiNa></PaNa> <DTofBi><date>1961-06-13</date></DTofBi> <Sex>M</Sex> </PID> <OBX SetID="1"> <ObsVa>150</ObsVa> <ObsId>Na</ObsId> <AbnFl>Above high</AbnFl> </OBX> ... Storing XML
Queries on Structured Data • Analogs of SQL • Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names • Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations Storing XML
Mixed Data: Library of Congress Documents of U.S. Legislation <bill bill-stage="Introduction""> <congress>110th CONGRESS</congress> <session>1st Session</session> <legis-num>H.R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr.Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name> ... </action-desc> Storing XML
Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML
Properties of XML Data • Variance in structured content • Elements of same type have different structure • Nested sub-element might depend on parent • Direct access to sub-element not required • Order significant in sequence & mixed content • Structured data embedded in text • Schema known a priori or “open content model” • Desirable: explicit support in storage system Storing XML
Properties of Queries • Query expressions depend on data properties • Variance • /PATIENT/(SURGERY | CHECK-UP) • Document order: XPath axes • /bill/co-sponsor[./text() = “Mrs.Clinton” and follow-sibling::co-sponsor/text() = “Mr. Torricelli”] • Node identity: equality, union/intersect/except • If not supported in storage system, then operators semantically incorrect or incomplete. Storing XML
II. Existing Storage Techniques Storing XML
Storage Techniques • Non-native • (Object) Relational, OO, LDAP directories • Indexing, recovery, transactions, updates, optimizers • Mapping from XML to target data model necessary • Captures variance in structured content • No support for mixed content • Recovering XML documents is expensive! • Native • Logical data model is XML • Physical storage features designed for XML Storing XML
Non-native Techniques • Generic • Mapping from XML data to relational tables • Models XML as tree: semi-structured approach • Does not use DTD or XML Schema • Schema-driven • Mapping from schema constructs to relational • Fixed mapping from DTD to relational schema • Flexible mapping from XML Schema to relational • User-defined • Labor-intensive Storing XML
Generic Mappings • Edge relation • store all edges in one table • Scalarvalues stored in separate table • Attribute relations • horizontal partition of Edge relation • Scalar values inlined in same table • Universal relation • full outer-join, redundancy • Captures node identity & document order • Element reconstruction requires multiple joins Storing XML
Edge Relation Example &0 HL7 &1 PATIENT &2 PID OBX &3 &4 … @IDNum PaNa DTofBi &5 &6 &7 PATID1234 “Jones Wm” date &8 1961-06-13 Edge Table Value Table Storing XML
Generic Mappings: LDAP Directories • Flexible schema; easy schema evolution • Supports heterogeneous elements with optional values • Captures node identity & document order • Query language captures subset of XPath Storing XML
LDAP Example XMLElement OC { SUBCLASS OF {XMLNode} MUST CONTAIN {order} MAY CONTAIN {value} TYPE order INTEGER TYPE value STRING } XMLAttribute OC { SUBCLASS OF {XMLNode} MUST CONTAIN {value} TYPE value STRING } oc:XMLElement oid:1 name:PID order: 1 PID @IDNum PaNa DTofBi Sex oc:XMLElement oid:1.2 name: PaNa order: 1 value: Jones Wm oc:XMLAttribute oid:1.1 name: IDNum value: PATID1234 “PATID1234” “Jones Wm” date M 1961-06-13 Storing XML
Schema-driven Mappings • Repetition : separate tables • Non-repeated sub-elements may be “inlined” • Optionality : nullable fields • Choice : multiple tables or universal table • Order : explicit ordinal value • Mixed content ignored • Element reconstruction may require multi-table joins because of normalization Storing XML
Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT (Name, (OBX)*)> <!ELEMENT OBX (Name, Value) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Value (#PCDATA) > PATIENT * OBX Name Value PATIENT OBX • Element with in-degree = 0 or > 1 in DTD graph relation • Elements with in-degree = 1 inlined except those reached by * • Non-* & non-recursive elements with in-degree > 1 inlined Storing XML
Flexible Mapping : LegoDB • Canonical mapping from XML Schema to relational • Every complex type relation • Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split • Greedy algorithm for choosing mapping • Mapping cost determined by query mix • Use relational optimizer to estimate cost of mapping Storing XML
LegoDB Example • Inline type in parent vs. Outline type in own relation type OBX = element value { Integer }, type Description type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML
User-Defined Mappings • No automatic translation from DTD or XML Schema • Annotated schemas or special-purpose queries • Value-based semantics only • Document structure represented by keys/foreign keys • No explicit representation of document order or node identity • Some support for mixed content Storing XML
Oracle 9i • Canonical mapping into user-defined object-relational tables • Arbitrary XML input • XSLT preprocessing into multiple XML documents, load individually • Stores XML documents in CLOBs (character large objects) • Permits full-text search • Hybrid of canonical mapping & CLOB <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) Storing XML
IBM DB2 XML Extender • Declarative decomposition of arbitrary XML • Pure relational mapping (no object features used) <element_node name="Order"> <table name="order_tab"/> <table name="part_tab"/> <condition> order_tab.order_key = part_tab.order_key </condition> <attribute_node name="key"> <table name="order_tab"/> <column name="order_key"/> </attribute_node> </element_node> • Mixed content CLOBs + side tables for indexing structured data embedded in text Storing XML