Store XML Data in a Relational Database

Store XML Data in a Relational Database • XML shredding • Selective mapping from XML to relations • Propagating constraints from XML to relations • Querying XML views QSX (LN 4)

Storing XML data • Flat streams: store XML data as is in text files • fast for storing and retrieving whole documents • query support: limited; concurrency control: no • Native XML Databases: designedspecifically for XML • XML document stored as is • Efficient support for XML queries • Many techniques need to be re-developed • Colonial Strategies: Re-use existing storage systems • Leverage mature systems (DBMS) • Simple integration with legacy data • Map XML document into underlying structures E.g., shred document into flat tables QSX (LN 4)

Why transform XML data to relations? • Native XML databases need to build new systems from the ground for • storing XML data, indexing, • query processing/optimization • concurrency control • updates • access control, . . . Nontrivial • Sophisticated query processing, storage and concurrency control techniques have been developed for relational DBMS Thus, why not take advantage of available DBMS techniques? QSX (LN 4)

From XML to relations Store and query XML data using traditional DBMS • Derive a relational schema from an XML DTD (schema) • Shred XML data into relational tuples – store XML data • Translate XML queries to SQL queries • Convert query results back to XML query answer XML query translation store DBMS RDB QSX (LN 4)

Relational schema generator XML query parsing query rewriting XML tagger Tagging results XML document shredder Architecture: XML Shredding query answer XML store query translation DBMS RDB QSX (LN 4)

Nontrivial issues • mapping DTDs to relational schemas: data model mismatch • DTD: recursive, regular expressions • relational schema: tables • query translation: query language mismatch • XQuery, XSLT: Turing-complete • SQL: first-order • Information preservation: • lossless: there should be an effective method to reconstruct the original XML document from its relational storage – the inverse of the storage mapping should exist • propagation/preservation of integrity constraints – will be seen shortly QSX (LN 4)

Relational schema generator XML document shredder Derivation of relational schema from DTD • Lossless: the original document can be effectively reconstructed from its relational representation • Query support: XML queries should be able to be rewritten to efficient relational queries • Normal form: BCNF, 3NF have proved extremely useful for integrity maintenance of updates -- will be addressed later QSX (LN 4)

Running example – a book document DTD: <!ELEMENT db (book*)> <!ELEMENT book (title, authors*, chapter*, ref*)> <!ELEMENT chapter (text | section)*> <!ELEMENT ref book> <!ELEMENT title #PCDATA> <!ELEMENT author #PCDATA> <!ELEMENT section #PCDATA> <!ELEMENT text #PCDATA> • Recursive • Complex regular expressions QSX (LN 4)

db * book title * * * author chapter ref * * text section Graph representation of the DTD • Each element type/attribute is represented by a unique node • Edges represent the subelement (and attribute) relations • *: 0 or more occurrences of subelements • Cycles indicate recursion e.g., book • Simplification: e.g., (text | section)*  text* | section* -- ignore order • XML document conforming to the DTD are those trees that unfold the graph (special treatment: *  empty at leaf) Can this be avoided? QSX (LN 4)

Canonical representation Store an XML document as a graph (tree) • Node relation: node( nodeId, tag, type) e.g., node(02, book, element), node(03, author, element) • Edge relation: edge( sid, did) sid, did: source and destination nodes; e.g., edge(02, 03) Pros and cons • Lossless: the original document can be reconstructed; order preserving: one can add “order” information to the edge relation • Ignore topological structure • Querying efficiency: Requires multi-table joins or self joins for element reconstruction A simple query /db/book[author=“Bush”]/title requires 3 joins of the edge relation! QSX (LN 4)

db * book title * * * author chapter ref * * text section Schema-driven: the shred inlining algorithm Require the availability of the document DTD • Represent the DTD as a graph (simplifying regular expressions) • Traverse the DTD graph depth-first and create relations for the nodes • the root • each * node • each recursive node • each node of in-degree > 1 • Inlining: nodes with in-degree of 1 are inlined – no relation is created QSX (LN 4)

db * book title * * * author chapter ref * * text section Relational schema To preserve the semantics • ID: each relation has an artificial ID (key) • parentID: foreign key coding edge relation • Column naming: path in the DTD graph db(dbID) book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) text(textID, chapterID, text: string) section(sectionID, chapterID, section: string) Note: title is inlined QSX (LN 4)

db * book title * * * author chapter ref * * text section Keys and foreign keys db(dbID) book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) text(textID, chapterID, text: string) section(sectionID, chapterID, section: string) • Keys: book.bookID, author.authorID, … • book.parentID db.dbID if code = 1 book.parentID ref.refID if code = 0 • Foreign keys: referring to parent node author.bookID book.bookID, similarly for chapter, ref text.chapterID chapter.chapterID,similarly for section QSX (LN 4)

Schema-Driven: Summary • Use DTD/XML Schema to decompose document • Shred inlining: • Rule of thumb: Inline as much as possible to minimize number of joins • Shared: do not inline if shared, set-valued, recursive • Hybrid: also inline if shared but not set-valued or recursive • Querying: It supports a large class of XML queries commonly encountered • Fast lookup & reconstruction of inlined elements • Reconstruction may require multi-table joins and unions • Twist of regular expressions: (text | section)*  text* | section* How to improve this? QSX (LN 4)

Schema-Driven: Summary Instance mapping can be easily derived from schema mapping. • Is it lossless? The order information is lost (simplification of regular expressions defining element types) • Is there anything missing? • “core dumping” the entire document to a new database In practice one often wants to • select relevant data from the document • store the selected data in an existing database of a predefined schema • XML Schema: type + constraints What happens to XML constraints? Can we achieve normal forms (BNCF, 3NF) for the relational storage? QSX (LN 4)

db * book title * * * author chapter ref * * text section Example: selectively storing XML data Existing relational database R : book (id, title) ref (id1, id2) Select data from XML and store it in R • books with title containing “WMD”, and • books cited, directly or indirectly Difference: • select only part of the data from an input document • store the data in an existing database with a fixed schema book document existing DB SQL inserts XML R QSX (LN 4)

Mapping specification: XML2DB mappings XML2DB Mapping: • Input: an XML document T of a DTD D, and an existing database schema R • Output: a list of SQL inserts R, updating the database of R An extension of Attribute Grammars: • treat the DTD D as an ECFG (extended context-free) • associate semantic attributes and actions with each production of the grammar • attributes: passing data top-down • actions: generate SQL inserts R • Evaluation: generate SQL inserts in parallel with XML parsing QSX (LN 4)

XML2DB mappings • DTD: normalized; element type definitions e    ::= PCDATA |  | e1, …, en | e1 + … + en | e* • Relation variables:for each relation schema Ri, define a variable Ri, which holds tuples to be inserted into Ri • Attributes: $e associated with each element type e $e: tuple-valued, to pass data value top-down • Rules: associated with each e  ; conditional statements • for each e’ in , define $e’ using the parent attribute $e • Insert with relation variables: Ri := Ri  {tuple} QSX (LN 4)

db * book title * * * db author chapter ref ... * * book book book book text section Example: XML2DB mapping • db  book* $book:= top /* indicatingthe children of the root $book $book $book $book QSX (LN 4)

Semantic actions • book  title, author*, chapter*, ref* if text(title) contains “WMD” or ($book <> top and $book <> bot) then id := gen_id( ); /* generate a new id */ book := book  { (id, text(title)) }; /* insert into book */ if $book <> top /* cited by another book */ then ref := ref  { ($book, id) }; /* insert into ref */ $ref := id; /* passing information downward */ else $ref := bot recall relation schema: book (id, title), ref (id1, id2) • gen_id( ): a function generating a fresh unique id • conditional: either has “WMD” or is referenced by a book of WMD db $book book ref ... ref chapter chapter ref title $ref $ref QSX (LN 4) “WMD”

Implementing XML2DB mappings XML R SAX actions SAX parsing: SQL inserts generation XML2DB parsing SQL inserts execution SAX parsing extended with corresponding semantic actions • startDocument( ), endDocument( ); • startElement(A, eventNo), endElement(A); • text(s) SQL updates: insert into book select * from book QSX (LN 4)

db book book book book isbn chapter title chapter isbn title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” XMLconstraints Recall: an XML schema consists of both types and constraints • (//book, {isbn}) -- isbn is an (absolute) key of book • (//book, (chapter, {number}) -- number is a key of chapter relative to book • (//book, (title, {})) -- each book has a unique title chapter chapter QSX (LN 4)

db book book book book isbn chapter title chapter isbn chapter chapter title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” Mapping from XML to a predefined relation One wants to store certain information from the XML document in: RDB: chapter(bookTitle, chapterNum, chapterTitle) • Mapping: for each book, extract its title, and the numbers and titles of all its chapters • Predefined relational key: (bookTitle, chapterNum) Can the XML document be mapped to the relation without violating the key? QSX (LN 4)

db book book book book isbn chapter title chapter isbn chapter chapter title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” A safe mapping Now change the relational schema to RDB: chapter(isbn, chapterNum, chapterTitle) The relation can be populated without any violation. Why? The relational key (isbn, chapterNum) for chapter is implied (entailed) by the keys on the original XML data: (//book, {isbn}) (//book, (chapter, {number}) (//book, (title, {})) QSX (LN 4)

Why do we care about constraints? • Constraints are a fundamental part of the semantics of the data – mapping from XML to relations should not lose the information • Relational constraints are important for query optimization, data cleaning, and consistency/integrity maintenance, . . . • Constraints help us determine whether a relational schema for storing XML data makes sense or not Problem statement: Constraint Propagation • Input: a set K of XML keys, a predefined relational schema S, a mapping f from XML to S, and a functional dependency FD over S • Output: is FDimplied by K via f? I.e., does FD hold over f(T) for any XML document T that satisfies K? Note: XML schema/DTD is not required – K is the only semantics QSX (LN 4)

db book book book book isbn chapter title chapter isbn chapter chapter title “XML” title section number section number “XML” number title number XPath “1” number text DTD number “10” “1” “6” Constraints can do even better One wants to find a “good” relational schema to store: chapter(isbn, bookTitle, author, chapterNum, chapterTitle) What is a good schema? In normal form: BCNF, 3NF, … • Prevent update anomaly (the relational theory) • Efficient storage, … But how to find a normalized design? QSX (LN 4)

Constraint propagation and normalization From given XML keys: (//book, {isbn}), (//book, (chapter, {number}), (//book, (title, {})) one can derive functional dependencies: isbn  bookTitle, isbn, chapterNum  chapterTitle Normalize the relation by using these functional dependencies: chapter(isbn, bookTitle, author, chapterNum, chapterTitle) book(isbn, bookTitle), chapter(isbn, chapterNum, chapterTitle), author(isbn, author) The new schema is in BCNF! QSX (LN 4)

Store XML Data in a Relational Database • XML shredding • Selective mapping from XML to relations • Propagating constraints from XML to relations • Querying XML views: Answering XML queries directly using relational DBMS: XML query support within immediate reach of most commercial DBMS QSX (LN 4)

XPath XPath: essential to XQuery and XSLT Q ::=  | A | * | Q/Q | Q ∪ Q | //Q | Q[q] q ::= Q | Q = ‘c’ | | q ∧ q | q ∨ q | not q • : empty path • *: wildcard that matches any label • A: either a tag (label) • /, ∪: concatenation (child), union • //: descendants or self, “recursion” • [q]: qualifier (filter, predicate) • c: constant (integer) • ∧, ∨, not( ): conjunction, disjunction, negation Existential semantics: class[//prereq] QSX (LN 6)

db * book title * * * author chapter ref * * section text Non-recursive XPath translation Find all books authored by Bush’s: book [author = ‘Bush’] / title select title from book, author where author.bookID = book.bookID and author.author = `Bush’ book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) The translation can be done by enumerating paths in the DTD matching the XPath query QSX (LN 6)

Recursive XPath over non-recursive DTD Find all books referenced by Bush’s book: book [author = ‘Bush’] // book / title db select title from book, author where author.bookID = book.bookID and author.author = `Bush’ * book * * * title author chapter ref book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) * * section text • The translation can be done by enumerating paths in the DTD • matching the XPath query, when • either the XPath query is non-recursive (no //) • or the DTD is non-recursive although this is possibly expensive (exponential size) QSX (LN 6)

db * book title * * * author chapter ref * * section text Recursive XPath over recursive DTD Find all books referenced by Bush’s book: book [author = ‘Bush’] // book / title book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) • Impossible to enumerate all matching paths • in the DTD -- infinitely many • the interaction between DTD recursion and XPath query recursion (//) • extension of SQL to handle recursion QSX (LN 6)

db * book title * * * author chapter ref * * section text Regular XPath Capture DTD recursion and XPath recursion in a uniform framework • Regular XPath: Q ::=  | A | Q/Q | Q ∪ Q | Q* | Q[q] q ::= Q | Q = ‘c’ | q ∧ q | q ∨ q | not q • The child-axis, Kleene closure, union • An XPath fragment: Q//Q instead of Q* Example: book [author = ‘Bush’] // book / title book [ author = ‘Bush’] / (ref/book/title)* Each edge corresponds to a relation QSX (LN 6)

More on regular XPath Compare regular XPath with • Regular expression Can we express regular XPath as a regular expression? • XPath Is XPath properly contained in regular XPath? • Regular XPath is more expressive than regular expression – qualifiers • equivalence of regular expressions is in PSPACE • equivalence of regular XPath is EXPTIME-hard • Regular XPath is more expressive than XPath • general Kleene star Q* in regular XPath • limited recursive // in XPath QSX (LN 6)

db * book title * * * author chapter ref * * section text Capture both DTD recursion and XPath recursion Step 1: Rewrite XPath queries over recursive DTDs into equivalent regular XPath queries book [author = ‘Bush’] // book / title => book [ author = ‘Bush’] / (ref/book/title)* • Always possible? • Complexity: low polynomial if the regular XPath query F(Q) is represented as a graph (similar to finite state automata) • Theorem. There is a computable • function F that, given any XPath • query Q over a (possibly) recursive • DTD D, rewrites Q into an • equivalent regular XPath query F(Q) • equivalent: for any XML tree T of D, Q(T) = F(Q) (T) QSX (LN 6)

Translate regular XPath to SQL Step 2: rewrite regular XPath query Qr to an equivalent “SQL” query Qs Equivalent: let M be the mapping from XML to relations. Then for any XML data T of the DTD D, Qr(T) = Qs( M(T) ) Question: how to handle recursion in Qr? Extension of SQL with recursion support • SQL’99: linear recursion: supported by IBM DB2 (connect-by) but not by other commercial systems (Oracle, Microsoft) • Fixpoint operator: Supported by • Oracle (connect-by) • IBM DB2 (with recursion) • Microsoft SQL Server 2005 (common table) QSX (LN 6)

Linear recursion in SQL’99 Linear recursion: (R, R1, …, Rk) S(0)  R S(i+1)  S(i)  (S(i) C1 R1)  …  (S(i) Ck Rk) Regular XPath to SQL translation via linear recursion • Construct query graph G: matching paths in the DTD • Partition G into strongly connected components • Encode each component using a linear-recursion operator • R: initial part -- incoming edges to components • Ri: SQL query encoding an edge in a “connected component” QSX (LN 6)

db * book title * * * author chapter ref * * section text Translate regular XPath to SQL – SQL’99 book [ author = ‘Bush’] / (ref/book/title)* with recursive S(from, to, title) as (R0 union S temp ) select title from S Temp: select ref.refID as from, book.bookID as to, book.title as title from ref, book where ref.bookID = book.bookID book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) QSX (LN 6)

db * book title * * * author chapter ref * * section text Translate regular XPath to SQL – SQL’99 book [ author = ‘Bush’] / (ref/book/title)* with recursive S(from, to, title) as ((select B1.bookID as from, B2.book.ID as to, B2.title as title from book B1, ref, book B2 where B1.bookID = ref.ID and ref.bookID = B2.bookID and B1.author = ‘Bush’) union (select S.from, temp.bookID as to, temp.title as title from S, temp where S.to = temp.refID)) select title from S book(bookID, parentID, code, title: string) author(authorID, bookID, author: string) chapter(chapterID, bookID) ref(refID, bookID) QSX (LN 6)

db * book title * * * author chapter ref * * section text Fixpoint operator (R) Single input relation: S(0)  R S(i+1)  S(i)  (S(i) C R) R  select refID as from, ref.bookID as to, title as title from ref, book where book.bookID = ref.bookID (R)  select from, to, title from R connect by from = prior to and prior from in R0 R0  select ref.refID as from, B2.bookID as to, B2.title as title from ref, book B1, book B2 where B1.bookID = ref.refID and ref.bookID = B2.bookID and B1.author = ‘Bush’ QSX (LN 6)

Region index approach • Each node v in an XML tree carries a pair (order, range), where • order: the position of v in the preorder traversal of the tree • range: the number of descendants of v • Node v is an ancestor of w iff • order_v < order_w • order_w < order_v + range_v • An efficient method to evaluate // Problem: updates are expensive maintain the annotation when inserts/deletes are carried out Research: translating XPath to SQL by using range index QSX (LN 6)

Store XML Data in a Relational Database • Commercial systems QSX (LN 4)

Commercial System: MS SQL Server 2005 • CLOB (character large objects), XML data type • Annotated schema (XSD): fixed tree templates • nonrecursive schema • associate elements and attributes with table and column names core-dumping entire elements or documents • XQuery: query(), value(), exist(), nodes(); binding relational data • Combine INSERT and node( ), value( ), XPath • OPENXML: access to XML data as a relational rowset selective shredding, limited recursion, can’t store the entire document in a single pass QSX (LN 4)

MS SQL Server 2005 (cont.) • Query support: • SQL • SQL extensions for publishing results as XML (FOR-XML clause) • Summary: • Support both entire-document and selective shredding • Able to incrementally update existing tables • Nonrecursive schema • Does not support context-dependent tuple construction: to construct (A, B), one cannot extract attribute B based on the extracted value of A (inf. passing) • Does not support entire document shredding and selective shredding in a uniform framework QSX (LN 4)

Commercial System: IBM DB2 XML Extender • XML Columns: CLOBs + side tables for indexing individual elements • User-defined mapping through DAD (Document Access Definition): a fixed XML tree template (nonrecusive) • SQL mapping: template-based language to publish relational data as XML (to be discussed in the next lecture) • RDB node mapping: for both publishing and storing XML (associating element and attributes with tables and columns) • XML Collections: Declarative decomposition of XML into multiple tables • Data loading: follows DAD mapping QSX (LN 4)

Commercial System: Oracle 10g XML DB • Store XML data in CLOB (character large objects) or tables • Canonical mapping into object-relational tables • tag names are mapped to column names • elements with text-only map to scalar columns • elements with sub-elements map to object types • list of elements maps to collections • Indexing: standard relational Core-dumping entire elements • Annotated schema: recursive QSX (LN 4)

Oracle 10g (cont.) • Query support: • CLOBs: SQL + Oracle Text; XPath • Canonical: SQL • Summary: • Support both entire-document and selective shredding • Support recursive schema • cannot insert into existing tables (DB2) • does not support context-dependent tuple construction QSX (LN 4)

Summary and Review • Why store and query XML data via relational DBMS? • Why study constraint propagation from XML to relations? • Understand the shred inlining algorithm • How to selectively map XML data to relations? Tuple construction via information passing? Combining shredding and parsing • Is it possible to translate all XQuery queries to SQL? If not, is there a general approach to querying XML data stored in relations? • Can one update XML data by making use of relational DBMS? How? Exercise: • Consider projects related to the topic – start the project early! • Survey projects: individual, commercial products (Microsoft SQL Server 2005, Oracle 10g, IBM DB2 XML Extender) QSX (LN 4)

Store XML Data in a Relational Database

Store XML Data in a Relational Database

Presentation Transcript

The Relational Data Model and Relational Database Constraints

XML to Relational Database Mapping

The Relational Data Model and Relational Database Constraints

XML Structures for Relational Data

XML and The Relational Data Model

Relational Database

Relational Database

The Relational Data Model and Relational Database Constraints

Storing XML Data in Relational Databases

Relational Database

Relational Database vs. Data Files

XML vs. Relational

Relational Database

Using XML to View Relational Data

The Relational Data Model and Relational Database Constraints

Storing and Querying Ordered XML Using a Relational Database System

SilkRoute: A Framework for Publishing Relational Data in XML

Viewing relational data as XML

Processing XML data using a relational database: Schema-Based XML Storage

The Relational Data Model and Relational Database Constraints

SilkRoute: A framework for publishing relational data in XML