520 likes | 536 Views
Learn about XML integration in database systems including storing, querying, and transforming data. Understand storage and publication methods, schema derivation, and relational representations. Explore the big picture applications and practical approaches in XML integration for databases.
E N D
Universal Database Systems Part 4: Databases and XML
Overview • Introduction to XML • DTDs and Schemas for XML Documents • Languages for XML, in particular XSL • Querying and Storing XML • Summary and Outlook UDBS Part 4 -Winter 2001/2
What Else Makes XML Relevant to Databases? • Document storage, retrieval, and indexing • Views and data warehouses • Transformations between relational data representations and XML • Mediators, data integration • Schemaaspects, integrity constraints, design, reverse engineering UDBS Part 4 -Winter 2001/2
Big Picture application application object-relational Integrate XML Data WEB (HTTP) Transform Warehouse application relational data legacy data UDBS Part 4 -Winter 2001/2
What Do We Look At? • Storage of XML documents • Publication of XML document • Sample vendor approaches UDBS Part 4 -Winter 2001/2
Storage vs Publication • Storage • Where and how to store native XML documents • Here: Storage in relational DBs • Publication • Wrap arbitrary content in XML documents • Here: Publication of relational content UDBS Part 4 -Winter 2001/2
Storing XML Data • Scenario: • receive a large XML data instance • want to store, manage it, query it • Solutions: • build an XML management system from scratch (e.g., Tamino) • preferably: use existing database systems • The Storage Problem:map XML data into relational UDBS Part 4 -Winter 2001/2
Approaches • Table-oriented: • As a column value • Across multiple tables • As a BLOB or CLOB • With appropriate (object) functionality, e.g., • Extender • Data Blade UDBS Part 4 -Winter 2001/2
Options • XML document as a text file (or CLOB) • Using ternary relations • Using a DTD or XML Schema for deriving adatabase schema • Alternatively: derivation of aschema throughmining from the data UDBS Part 4 -Winter 2001/2
TextFile/CLOB • Advantages • simple • less space than you think • reasonableclustering • Disadvantages • noupdates • needs specific queryprocessor UDBS Part 4 -Winter 2001/2
Ref Source Label Dest &o1 paper &o2 &o2 title &o3 &o2 author &o4 &o2 author &o5 &o2 year &o6 Val Node Value &o3 The Calculus &o4 … &o5 … &o6 1986 Ternary Relation(edge-oriented) &o1 paper &o2 year title author author &o3 &o4 &o5 &o6 "The Calculus" "…" "…" "1986" Order, attributes, comments? UDBS Part 4 -Winter 2001/2
Other Relational Representations • More detailed edge representation, e.g.,(Source, Dest, Name, Type, Ordinal)with additional tables for all types • Universal relation for storing edges(as full outer join of all ternary relations as above) • Separate value tables per type • Inline representation of the various types, e.g., (Source, Dest, Name, Int, String, Ordinal)with null values for the non-applicable types UDBS Part 4 -Winter 2001/2
Derive Schema from DTD (1) • DTD: <!ELEMENT bib (paper*)> <!ELEMENT paper (author*,title,year)> <!ELEMENT author (firstname, lastname)> • Relational Schema: • Paper(pid, title, year) • Author(aid, fn, ln) • PaperAuthor(pid, aid) • Sometimes this is poor. E.g. • 80% of papers have <= 2 authors • 18% have 3 authors • 2% have 4 or more… UDBS Part 4 -Winter 2001/2
Derive Schema from DTD (2) • DTD: • ODMG classes: • <!ELEMENT employee (name, address, project*)> • <!ELEMENT address (street, city, state, zip)> • class Employee publictypetuple • (name:string, address:Address, project:List(Project)) • class Address publictypetuple (street:string, …) UDBS Part 4 -Winter 2001/2
Storage vs Publication • Storage • Where and how to store native XML documents • Publication • Wrap arbitrary content in XML documents • Here: Publication of relational content • Current business data is relational data • Scalability, reliability, performance UDBS Part 4 -Winter 2001/2
XML Documents publish transform Web Server(XSL) HTML Publication Scenario RelationalDatabase UDBS Part 4 -Winter 2001/2
SUPPLIERS SNO SNAME PARTS CATALOG PNO DESCRIP SNO PNO PRICE Pragmatic ("Rowset") Approach • Tablesrepresented as simple XML trees: • table = root • each row becomes a nestedelement • each value becomes another nested element <suppliers> <s_tuple> <sno> <sname> <parts> <p_tuple> <pno> <descrip> <catalog> <c_tuple> <sno> <pno> <price> UDBS Part 4 -Winter 2001/2
Pragmatic ("Rowset") Approach (2) • No "natural" XML • No nesting, no hierarchies, no mapping for (foreign) keys • May use XSLT to obtain "real" XML • Better: Publish structured documents UDBS Part 4 -Winter 2001/2
Publish XML Documents Following presentation based on • Jayavel Shanmugasundaram et al:Efficiently Publishing Relational Data as XML Documents, VLDB 2000 Two issues • Language for conversion • Flat relational data to nested XML • Implementation of conversion • Efficient conversions UDBS Part 4 -Winter 2001/2
Sample Conversion Language • Relational data SQL extension • Natural extension UDFs • More specific • Nesting through subqueries • UDFs to construct XML elements/attributes from SQL data • Aggregate functions to group children UDBS Part 4 -Winter 2001/2
Department Employee DeptId DeptName DeptId EmpName Salary EmpId 10 John 50K 101 Purchasing 10 91 10 Mary 70K Project DeptId ProjName ProjId 10 Internet 888 795 10 Recycling Sample Relational Database Task: Publish single XML document with information on departments (including their employees and projects) UDBS Part 4 -Winter 2001/2
Publication Query – Structure Select DEPT(d.name, <subquery to produce emplist>, <subquery to produce projlist> )From Department d UDBS Part 4 -Winter 2001/2
XML Constructor Create FunctionDEPT(dname: varchar(20), emplist: xml, projlist: xml) As( <department name={dname}> <emplist> {emplist} </emplist> <projlist> {projlist} </projlist></department> ) UDBS Part 4 -Winter 2001/2
Publication Query Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), (Select XMLAGG(PROJ(p.name)) From Project p Where p.deptno = d.deptno))From Department d Select DEPT(d.name, <subquery to produce emplist>, <subquery to produce projlist> )From Department d UDBS Part 4 -Winter 2001/2
Query Result <department name="Purchasing"> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist> </department> UDBS Part 4 -Winter 2001/2
Implementation of Conversion • Two main differences: • Nesting (structuring) • Tagging • Space of alternatives: Early Tagging Late Tagging Outside Engine Outside Engine Early Structuring Inside Engine Inside Engine Outside Engine Late Structuring Not applicable Inside Engine UDBS Part 4 -Winter 2001/2
Options • Late vs early tagging • Late tagging: final step of query processing • Early tagging: otherwise • Late vs early structuring • Late structuring: final step of query processing • Early structuring: otherwise • Inside vs outside engine • Inside: completely inside db engine • Outside: otherwise(ignored in following) UDBS Part 4 -Winter 2001/2
Early Tagging, Early Structuring, Outside Engine:Stored Procedure Approach • Issue queries for sub-structures and tag them • Could be a Stored Procedure (10, Purchasing) DBMS Engine Department (Internet) (Recycling) (John) (Mary) Employee Project • Problem: Too many SQL queries! UDBS Part 4 -Winter 2001/2
Early Tagging, Early Structuring, Inside Engine:Correlated CLOB Approach • Query seen above (with UDFs to create XML) • Problem: Correlated execution of sub-queries UDBS Part 4 -Winter 2001/2
Early Tagging, Early Structuring, Inside Engine:De-Correlated CLOB Approach • Compute employee lists associated with all departments • Compute project lists associated with all departments • Join results above on department id • Problem:CLOBs during query processing UDBS Part 4 -Winter 2001/2
(10, John) (10, Mary) (Purchasing, John, Internet) (Purchasing, John, Recycling) (Purchasing, Mary, Internet) (Purchasing, Mary, Recycling) (10, Purchasing) (10, Internet) (10, Recycling) Late Tagging, Late Structuring: Redundant Relation Approach • How do we represent nested content as relations? • Problem: Large relation due to data redundancy! UDBS Part 4 -Winter 2001/2
Employee Department Department Employee Project Project Union Late Tagging, Late Structuring:Outer Union Approach • How do we represent nested content as relations? (Purchasing, null, Internet , 0) (Purchasing, null, Recycling, 0) (Purchasing, John, null , 1) (Purchasing, Mary, null , 1) (Purchasing, John) (Purchasing, Mary) (Purchasing, Internet) (Purchasing, Recycling) (10, Purchasing) • Problem: Wide tuples (having many columns) UDBS Part 4 -Winter 2001/2
Late Tagging, Late Structuring:Hash-based Tagger • Results not structured early • In arbitrary order • Tagger has to enforce order during tagging • Hash-based approach • Inside/Outside engine tagger • Problem: Requires memory for entire document UDBS Part 4 -Winter 2001/2
A Late Tagging, Early Structuring:Sorted Outer Union Approach A B n D nn n A B n n E n n B C A n C n n F n D E F G A n C n nn G Sort By: Aid, Bid, Cid • Problem: Only partial ordering required UDBS Part 4 -Winter 2001/2
Late Tagging, Early Structuring:Constant Space Tagger • Detects changes in XML document hierarchy • Adds appropriate opening/closing tags • Inside/outside engine UDBS Part 4 -Winter 2001/2
Performance of Alternatives • Constructing XML inside engine more efficient than outside • When processing can be done in main memory: Late tagging, late structuring with outer union • Otherwise: Late tagging, early structuring with sorted outer union UDBS Part 4 -Winter 2001/2
Sample DB Vendors and XML • Oracle 9i • IBM DB2 UDB V7 UDBS Part 4 -Winter 2001/2
Oracle – CLOB • Datatype XMLType (internally a CLOB) • Predefined functions • createXML: creates XMLType-instance from string (if well-formed) • extract: applies XPath expression to XMLType-instance and returns XMLType • existsNode: checks whether XMLType-instance has non-empty result for given XPath expression UDBS Part 4 -Winter 2001/2
Oracle – Generate XML • Functions • SYS_XMLGEN: takes single argument and converts it to an element • SYS_XMLAGG: concatenates XML fragments • Utility XSU (XML SQL Utility): Implements "Rowset" approach UDBS Part 4 -Winter 2001/2
DB2 – CLOB • Three datatypes • XMLCLOB (outside table) • XMLVARCHAR (inside table) • XMLFILE (external file) • Checks against DTD possible • PAGE tables? Via DADs UDBS Part 4 -Winter 2001/2
Basic Approach • XML document is stored • completely (in a column of type XML, or an "XML column") or as file reference, or • in multiple tables as result of a mapping (an as "XML collection") • A Document Access Definition (DAD) specifies how XML documents are stored and published, how XML maps to tables and vice versa UDBS Part 4 -Winter 2001/2
DB2 XML document <?xml?> <!DOCTYPE ...> <Order key = "1"> </Order> XMLCLOB/XMLVARCHAR XML Column UDTs: - XMLCLOB- XMLVARCHAR- XMLFILE • UDFs: • Import/Storage- Retrieval- Extract- Update UDBS Part 4 -Winter 2001/2
Legend • XMLFile for external file names XMLVarchar for internal short documents XMLCLOB for internal long documents • Extract • Extracts XML element/attribute values from documents • Converts values from XML documents into SQL data types • Provides scalar as well as tabular UDFs UDBS Part 4 -Winter 2001/2
Example Select db2xml.extractDouble(Order, ‘/order/part/extendedPrice‘)from OrderTablewhere InvoiceNumber = 356 UDBS Part 4 -Winter 2001/2
XML Collection DB2 XML document <?xml?> <!DOCTYPE ...> <Order key = "1"> </Order> • Stored Procedures • Composition- Decomposition DAD Collection UDBS Part 4 -Winter 2001/2
Sample DAD <DAD> <Xcollection> <SQL_stmt> SELECT book_id, price_date, price_text FROM book_table ORDER BY price_date </SQL_stmt> <doctype> <root_node> . . . . . </root_node> </doc_type> </Xcollection> </DAD> UDBS Part 4 -Winter 2001/2
Overview • Introduction to XML • DTDs and Schemas for XML Documents • Languages for XML, in particular XSL • Querying and Storing XML • Summary and Outlook UDBS Part 4 -Winter 2001/2
Discussion • Do XML documents have to be stored directly in the database? • XML documents are highly redundant (from a database perspective) • The efficiency of a relational system (partially) comes from normalization • Compromise: XML as an "intermediary" between the database and, say, a Web server UDBS Part 4 -Winter 2001/2
Summary • XML asks for query languages, database-style • Database vendors experiment with • XML extensions • architectures • languages • internal data models • Many open issues, e.g., • Graphical query languages • Updates ACM SIGMOD 2001 • Views defined in the query language • Referential integrity, triggers, rules • Distributed XML storage systems UDBS Part 4 -Winter 2001/2
Outlook for DBMS • XML is and important database topic(both for practitioneers and for theoreticians) • Declarative querying SQL-style is attractive • Will there be a renaissance of hierarchical DBMS? • Workshop: WebDB, annually,German counterpart as GI-Arbeitskreis "Web und Datenbanken" • Initiatives found on the Web:XML:DB and Xindice UDBS Part 4 -Winter 2001/2