XML and Databases

XML and Databases 198:541

XML Motivation

XML Motivation • Huge amounts of unstructured data on the web: HTML documents • No structure information • Only format instructions (presentation) • Integration of data from different sources • Structural differences • Closely related to semistructured data

Semistructured Data • Integration of heterogeneous sources • Data sources with non rigid structures • Biological data • Web data • Need for more structural information than plain text, but less constraints on structure than in relational data

Characteristics of Semistructured Data • Missing or additional tuples • Multiple attributes • Different types in different objects • Heterogeneous collection • Self-describing, irregular data with no apriori structure

HTML Document Example Type of information <h1> Bibliography </h1> Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Title Authors Year book

The Idea Behind XML • Easily support information exchange between applications / computers • Reuse what worked in HTML • Human readable • Standard • Easy to generate and read • But allow arbitrary markup • Uniform language for semistructured data • Data Management

XML

XML • eXtensible Markup Language • Universal standard for documents and data • Defined by W3C • Set of emerging technologies • XLink, XPointer, XSchema, DOM, SAX, XPath, XQuery,…

XML • XML gives a syntax, not a semantic • XML defines the structure of a document, not how it is processed • Separate structural information from format instructions

XML Example <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography>

XML Terminology • Tags: book, title, author,… • Start tag: <book> • End Tag: </book> • Elements are nested • Empty Element • <reviews></reviews> => <reviews/> • XML Document: single root element • XML Document is well formed: matching tags

XML Attributes • Attributes are <name, value> pairs that characterize an element. <book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Can define oid, but they are just syntax

More XML • Text can be CDATA or PCDATA • Entity References: &amp:&, &gt:>,… • Processing Instructions: <?blink?> • Comments:

Well Formed XML Documents • Elements must be properly nested • <book><title> Foundations of Databases </title></book> • But Not: • <book><title> Foundations of Databases</book></title> • There must be a unique root element • Elements can be of • ‘element content’ • or ‘mixed content’: • <title>This is MixedContent</title>

XML: Potential • Flexible enough to represent anything • Stock market, DNA, Music, Chemicals • Weather information • Wireless network configuration • Enables easy information exchange • Between companies • Within companies • Standard: everybody uses the same technology

XML: Limitations • XML is only a syntax for documents • We need tools! • Editors and parsers • Programming APIs (for Java, C++, etc.) • Languages to manipulate XML (how many books?) • Schemas (What is a book like?) • Storage (What if you have a lot of XML?) • Transfer protocols (How do you exchange it?) • What about XML in Chinese…? • How can XML fit into my phone…? • Query processing? • …

XML Schema Language

DTDs: Document Type Descriptors • Similar to a schema • Grammar describing constraints on document structure and content • XML Documents can be validated against a DTD <!ELEMENT Book (title, author*)><!ELEMENT title #PCDATA><!ELEMENT author (name, address, age?)><!ATTLIST Book id ID #REQUIRED><!ATTLIST Book pub IDREF #IMPLIED>

Shortcomings of DTDs • Useful for documents, but not so good for data: • No support for structural re-use • Object-oriented-like structures aren’t supported • No support for data types • Can’t do data validation • Can have a single key item (ID), but: • No support for multi-attribute keys • No support for foreign keys (references to other keys) • No constraints on IDREFs (reference only a Section)

XSchema • In XML format • Includes primitive data types (integers, strings, dates,…) • Supports value-based constraints (integers > 100) • Inheritance • Foreign keys • …

Example of XSchema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>

XML Storage

Storing XML Data • Different approaches: • Storing as text • Using RDBMS • Using a native system Tailored for XML, (NATIX, Tamino, Ipedo, etc.) Performance of the various approaches depends on your application

Storing XML as Text • Simple • Easy to compress • No updates • Need to parse the document every time it is needed

Storing XML in RDBMS • Uses existing RDBMS techniques • Costly in space, takes time to reconstruct original document • Example techniques: • Schema with 2 relations: tag and value • Schema with n relations: 1 per element name

Accessing and Querying XML Data

XML as a Tree: DOM • DOM = Document Object Model • Class hierarchy serving as an API to XML trees • Methods of those classes can be used to manipulate XML (e.g., Node::child, Node::name) • Can be used from Java, C++ to develop XML applications. • Each node has an identity (i.e., a unique identifier) in the whole document

XML as a DOM Tree • Class hierarchy(node, element attribute) bibliography book book title author author author publisher year Foundations ofDatabases Abiteboul Hull Vianu Addison Wesley 1995

XML as a Stream: SAX • XML document = event stream. E.g., • Opening tag ‘book’ • Opening tag ‘title’ • Text “Foundations of databases” • Closing tag ‘title’ • Opening tag ‘author’ • Etc. • SAX allow you to associate actions with those events to build applications • Very efficient since it corresponds to events during parsing, but not always sufficient.

XPath • Language for navigating in an XML document (seen as a tree) • One root node • types of nodes: root, element, text, attribute, comment,… • XPath expression defines navigation in the tree following axis: child, descendant, parent, ancestor,…

XPath: Examples • Find all the titles of all the books: • //book/title • Find the title of all books written by Charles Dickens • //book[author=“Charles Dickens”]/title • Find the title of the first section in the second chapter in “Great Expectations” • //book[title=“Great Expectations”]/chapter[2]/section[1]/title • Find the title of all sections that come after the second chapter in “Great Expectations”: • //book[title=“Great Expectations”]/chapter[2]/following::section/title

Querying XML Data • Need for a language to query XML data • Should yield XML output • Should support standard query operations • No schema required • Several work on an XML query language: XML-QL, XQuery,..

XQuery • XPath included in XQuery • FLWR expressions: for let where return FOR$xINdocument("bib.xml")/bib/book WHERE$x/year > 1995 RETURN$x/title Result: <title> abc </title> <title> def </title> <title> ghi </title>

How to process XML Queries? • Use indexes • Need to identify nodes • Need to know relations between nodes • Labeling Schemes • Dewey encoding • Prefix-Postfix encoding • Twigstack

Web Services

What are Web Services • Programming interfaces for application to application communication on the Web • platform-independent, • language-independent • object model-independent • Possibility to activate methods on remote web servers (RPC) • 2 main applications • E-commerce • Access to remote data

XML and Web Services • Exchange of information between application is in XML • Input and Result • Use of SOAP to generate messages • Descriptions of the web service functionality given in XML, according to the WSDL schema Web Services standards use XML heavily

XML: a very active area Many research directions Many applications Standards not finalized yet: XQuery XML Schema Web Services… Conclusions

Some Important XML Standards • XSL/XSLT: presentation and transformation standards • RDF: resource description framework (meta-info such as ratings, categorizations, etc.) • XPath/XPointer/XLink: standard for linking to documents and elements within • Namespaces: for resolving name clashes • DOM: Document Object Model for manipulating XML documents • SAX: Simple API for XML parsing • …

References • XML • http://www.w3.org/XML/ • Sudarshan S. Chawathe: Describing and Manipulating XML Data. IEEE Data Engineering Bulletin 22(3)(1999) • XML Standards • http://www.w3.org/ (XSL, XPath, XSchema, DOM…) • Storing XML Data • Daniela Florescu, Donald Kossmann: Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin 22(3)(1999) • Hartmut Liefke, Dan Suciu: XMILL: An Efficient Compressor for XML Data. SIGMOD Conference 2000 • XQuery • http://www.w3.org/TR/xquery/ • Peter Fankhauser: XQuery Formal Semantics: State and Challenges. SIGMOD Record 30(3)(2001) • Web Services • http://www.w3.org/2002/ws/

XML and Databases