750 likes | 907 Views
XML In An RDBMS World. Michael D. Thomas mdthomas@ibiblio.org. Data, data, data. In general, computing is never far from data XML, RDBMSes deal with data – play in the same very large playground Extensive overlap between RDBMSes and XML
E N D
XML In An RDBMS World Michael D. Thomas mdthomas@ibiblio.org
Data, data, data • In general, computing is never far from data • XML, RDBMSes deal with data – play in the same very large playground • Extensive overlap between RDBMSes and XML • Goal: Learn how to choose between RDBMS vs. XML, how to integrate the two technologies
First, some pragmatism… • “When all you have is a hammer, everything looks like a nail.” • What if all you have is a hammer? • “When all you have is a hammer, everything must look like a nail.” • Best to have XML and RDBMS in your toolset • Architectural purity is often impossible • But we should know when & what we are compromising
Tonight’s Agenda • Compare XML and SQL RDBMS Technologies • Understand Their Sweet Spots • Examine Their Use Through Studies of Anti-Patterns and Patterns Of Usage • Discuss XQuery and XML Storage In The RDBMS
Topics • Defining The Playground – XML vs. RDBMS • Anti-Pattern: XML As Transactional Data Store • Data Stores & The ACID Test • Anti-Pattern: Document Storage & The RDBMS • XQuery & RDBMS Doc Storage • Anti-Pattern: XML Declarative Languages & Gratuitous Encapsulation Of SQL • Understanding Declarative Languages, Such As SQL • Anti-Pattern: XML Meta-data & The Overly Abstract Database • Managing The Split Between XML-based & RDBMS-based Data Models In The Same Application
XML As Database Anti-Pattern • “I’ll use an XML file as a database” • Advantages: Simple, Cheap • An XML Schema defines the metadata of the database • Data is right on the filesystem, not “hidden” in a complicated SQL RDBMS
XML As Database Problems • Updates are hard • Have to re-invent concurrency handling – multiple users changing data at the same time • No constraint checking • No optimization, such as indexes, caching, etc.
Organization of Data • Been formalizing the storage information for a long time • Age of computing called for new approaches • Relational model highly successful • XML is relatively new • Good for loosely structured data – documents -- and data transmission
Transmission & Interoperability • Electronic transmission is newer than electronic storage • With the growth of the Internet, transmission of data is exploding • No Interoperability problems in 1950 – lots of interoperability problems now • XML is an important standard for transmission and interoperability
Vocabulary • Datastore – Anything that stores data • SQL Relational Database – A database that organizes info in tables and adheres to relational theory • XML – 1) eXtensible Markup Language 2) The XML standard 3) All or part of an XML document
Vocabulary • XML Database – A database specialized for the storage of XML documents • Object Database – A database that stores objects • Object-Relational Database – A relational database with an extensible type system
Data Centric vs. Document Centric • Data Centric approach – the datastore is focused on handling highly structured, fine grained data. Favors the relational model • Document centric approach – the datastore is focused on handling semi-structured data, such as web pages, books, etc. Favors XML
Data Centric vs. Document Centric • Best queries are largely a result of structuring the data well (Messy desk vs. organized file cabinet) • Documents are semi-structured data with ad hoc structures • The overhead of defining rigorous structure for each type of document increases the overall cost of management • (Would still give you the best queries)
Storage vs. Transmission • Data Storage and Data Transmission are two different concepts • XML is very strong for interoperable transmission • Can store by writing XML to a file • You can ‘transmit’ relational data by exporting a few tables and ftping, but isn’t a strong solution
Three Data Models • Persistent Model – how data is stored • Active Model (Object Model) – how data is arranged when it is being manipulated by a program, usually written in an imperative language • Presentation/Transmission Model – how data is transmitted, usually as XML
Datastore Basics • Any Datastore must tackle the following issues: • Concurrency • Transactions – the ACID test • Locking • Joins • Normalization • Administration Issues
Concurrency • Datastores support concurrency if multiple users can access the same datastore at the same time • Datastores must not allow the same data to be modified at the same time
Transactions • Atomicity – No matter how complex, a transaction is atomic and indivisible. Transactions are “all or nothing.” • Consistency – Transactions must leave the database in a consistent state, i.e., consistent with the rules • Isolation – Transaction is isolated from other transactions • Durability – The effects of a transaction persist
Isolation levels • TRANSACTION_READ_UNCOMMITTED – Dirty reads, non-repeatable reads, phantom reads • TRANSACTION_READ_COMMITTED – Non-repeatable reads, phantom reads • TRANSACTION_REPEATABLE_READ – phantom reads • TRANSACTION_SERIALIZABLE
Locking • Data must be locked for transactions to be isolated • Locking is both a datastore and an application concern • How much extraneous data is locked? (Page level locking, document level locking) • Pessimistic locking: prevents reading of locked data • Optimistic locking: generates an error when inappropriate data updates are attempted
Joins • A Join joins the data between two different data entities • E.g., SELECT * from emp, dept WHERE emp.deptno = dept.deptno • Joins are the cornerstone of SQL • XPath doesn’t do joins between XML documents! • XQuery, others can
Normalization • Normalization – organizing data to minimize redundancy • Normalized data is easier to maintain and easier to understand conceptually • In relational db design, normalized DBs need to be denormalized for performance reasons • XML docs can also be normalized, but not as much support for tying elements together • Normalization is important when designing, less crucial at implementation
Administrative Issues • Includes backup & recovery, installation, upgrades, optimization, maintenance, etc. • In general: • Bigger is better (economies of scale, 24x7 support) • More popular, more standard is better (law of increasing returns) • Whatever is already working in your organization is better (no need to hire, re-train administrators) • Existing DB vendors have a huge advantage
Datastores • Different types: • Relational Database – highly structured data such as account balances, inventory quantities, etc.) • Document Database – used to store documents, probably in XML format • The same DB can serve as both, e.g., Oracle
Overview Of Relational Databases • The Relational Model • SQL • Entity Relationship Diagrams
Relational Model -- structure • Data is grouped in tables • Tables have columns and rows • Columns are fairly fixed – the set of columns shouldn’t change much (if at all) over the life of a table • A table can have any number of rows • Rows change constantly
Relational model – primary key • In general, a table should have one or more columns defined as the primary key • The primary key is unique and non-null • Usually only one column • Can consist of more than one column (composite primary key)
Relational model – foreign key • A foreign key describes a relationship between two tables • The foreign key column of tableA points to a column in tableB • TableB is said to be the parent of tableA • Often, the foreign key points at a primary key
SQL Three types: • Query SELECT * FROM emp WHERE deptno=10 ORDER BY ename • Data Modification Language (DML): Update, Insert • Data Definition Language (DDL): Create Table
Joins • A Join is a Cartesian Cross-Product: SELECT count(*) FROM emp; SELECT count(*) FROM emp; SELECT count(*) FROM emp, dept; SELECT ename, emp.deptno, dept.deptno, dname FROM emp, dept; SELECT ename, emp.deptno, dept.deptno, dname FROM emp, dept WHERE emp.deptno = dept.deptno;
Joins SELECT ename, dname FROM emp, dept WHERE emp.deptno = dept.deptno
Entity Relationship Diagrams • Used to create a map of your data • Describes tables, attributes of tables and relationships between tables • Come at it from two directions: lay out the entities, assign the attributes; group the attributes into entities
XML vs. SQL Tables • XML – order matters! (Rows in a relational table are unordered) • XML is a tree structure • XML documents tend to be semi-structured, SQL tables are highly structured • SQL tables aren’t as flexible or inter-changable • XML joins aren’t straightforward • An XML document/element doesn’t serve multiple purposes as well as DB schema/table
Anti-Pattern: XML Document Storage As A SQL BLOB • Need to store XML in a database • Better than storing in a filesystem • Make a BLOB (Binary Large Object) and store a document as an element in a row of a table • Problem: can’t query contents of document without extracting it from the db! • For simple queries, might have to extract all of the documents into the application – very inefficient
Anti-Pattern: One-off SQL Schema For A Particular XML Document • BLOB storage is bad, so why not “shred” the document across multiple DB tables? • I.e., for XML elements named “Dept” make a “Dept” table, “Emp” elements make a “Emp” table, describe the hierarchy with foreign key constraints • Query performance is much, much better – probably better than XPath against XML as a file
Problems • A lot of work! Have to do this for every XML schema • Hard! XML schemas are inherently more flexible than SQL schemas. Some mappings can be difficult • Negates flexibility of XML • Not as learnable – Programmers have to learn to query your SQL schema, not just XPath and XML Schema
Anti-Pattern: Developing A General “Shredding” Solution For All Of XML • Your Application has several XML schemas, and it’s time-consuming to develop “shredding” for all of them • So, you try to do a more generalized shredding • Is possible, but is a very horizontal problem. You probably won’t get the time to solve it completely. • Still might present learnability problems
XML Datastore Architectures • XML Views Of Relational Data • Relational Wrappers Of XML • Independent Storage of XML Documents (Native XML Database) • Text Storage Of XML In RDBMS • XML Shredding Across Relational Tables • Storing XML as Objects in Object- Relational DBs (Oracle XMLType) • Everything is XML (XQuery approach)
Vocabulary: XML Collections • XML Collection is a collection of XML documents • “A row is to a table as an XML document is to an XML collection.”
XML Derived From RDBMS • Data exists naturally as relational data • Needs to be represented as an XML document for some reason • The derived XML is usually used for transmission
Independent XML Store • A specialized database is used to store XML documents • Typically, an application will either have two datastores – relational and XML – or relational data will be stored as XML • In the dual datastore case, the application code has to join the different data sets
Text Storage Of XML In Databases • XML is stored in a column of a relational table • Allows you to mix structured and semi-structured approaches • However, hard to query against the XML directly
XML Shredding Across Tables • An XML document is stored across many tables • Is possible to use SQL queries against the document’s parts • Vendors can implement an XPath-to-SQL translator • Can structure the tables based on a schema • A pain to handle yourself
Exercise • Shred an XML schema across relational tables
XMLType • Object-Relational databases allow you to define your own types • You could define an Address object, define functions for the object, and store instances of the object in a column in the database • Oracle defines an object-relational type, XMLType, for the storage of XML documents in the database • SELECT e.poDoc.getClobval() AS poXML FROM po_xml_tab e WHERE e.poDoc.existsNode('/PO[PNAME = "po_2"]') = 1; • Shredding is managed & encapsulated for you
Everything Is XML (XQuery) • With XMLType, Oracle says that everything fits in to the Object-Relational realm • XQuery says that everything can be represented and queried as XML • Relational data is derived from RDBMS using SQL/XML