Management of XML Documents without Schema in Relational Database Systems

Management of XML Documents without Schema in Relational Database Systems Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics

Overview • Introduction • Motivation • Main Issues • Structure-Oriented Approach • Storage / Data Model • Queries • Evaluation • Opaque Approach • Storage • Queries (vs. XPath) • Evaluation • Prototype Implementation • Interface • Experience • Outlook

Motivation • XML is used in: • data publishing (document-centric documents) • data exchange (data-centric documents) • Why XML Documents without Schema? • generated by programs • mostly data-centric documents, e.g., account statements • high update frequency of the document structure evolving schemas • Problems • How to deal with XML documents without DTD / XML Schema in databases? • Evaluate approaches • Use relational database systems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Main Issues • Evaluate Storage and Retrieval Methods • no defined document schema • platform: Oracle 8i • XML-to-Relational Mapping Approaches • structure-oriented decomposition • opaque approach • Identify Parameters of a Unified XML-DB Interface • Implement a Testbed • qualitiative assessment • performance of both approaches • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Structure-Oriented Approach • Characteristics • decomposition of an XML document into smaller units (elements) • depends on the document structure only • target system: relational DBMs, generic schema • Variety of Mapping Methods • model XML document als directed ordered labeled graphs and map them to tables • proposed algorithms: • edge tables [Florescu, Kossmann] • universal table • inlining techniques [Shanmugasundaram et.al.] • model-based fragmentation • Monet XML-model • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Storing XML Data in Relations • XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Data Model 1 1 n n 1 1 1 1 • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook 0/1 0/1 0/1 0/1

Import Algorithm <tree> <person age=“36“> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Query Processing • Query Language • XML query language most appropriate • data model of our solution is based on XML-QL XML-QL preferred choice • XML-Relational Mismatch • relational DBMS “understands“ SQL only  requires translation from XML-QL to SQL • generate result document from the tuples retrieved • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Query Processing XML-QLQuery Parser ObjectStructure GenerateSQL Statement SQLStatement DB ExecuteSQL Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook ConstructResult Document Row Set XMLDocument

XML-QL Query CONSTRUCT <result> { WHERE <person> <name>$n</name> <address>$a</address> </person> CONSTRUCT <person> <name>$n</name> <address>$a</address> </person> } </result> SQL Statement SELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_Value FROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs E WHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+)) Generate an SQL-Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Construct Result Document • Result Tuple • Subtree Reconstruction SELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+) • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Query Result • XML Result Document <result> <person> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </result> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Advantages • Vendor Independence • no specific DBMS features needed • Stability • High Flexibility of Queries • retrieve and update single values • full SQL functionality can be used • Well-Suited for Structure-Oriented Queries • structures are represented in tables • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Drawbacks • Information Loss • Comments • Processing Instructions • Prolog • CDATA Sections • Entities • Restrictions • only one text (content) per element <element> Text1 <subelement/> Text2  lost </element> • element text as VARCHAR(n); n <= 4000 • Increased Load Time • sample document: 3.3. MB, 130.000 tuples, 13 minutes • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Opaque Approach • Characteristics • XML document stored as Large Object (LOB) • document completely preserved • Storage Insert into tblXMLClob values (1,‘person.xml‘,‘ <person> <name>Mary</name> </person>‘ ); • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Oracle interMedia Text • Query Facilities of interMedia Text • full-text retrieval (word matching only) • path expression only together with content search • no range queries • Example in interMedia Text: SELECT DocId FROM tblXMLClob WHERE CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0 • XML Full-Text Index • Autosectioner Index • XML Sectioner Index • WITHIN operator • text_subquery WITHIN elementname • searches the entire text content of the named tag • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

return document IDs word matching (default) no existence test for elements or attributes restricted set of path expressions using WITHINe.g.: (xml WITHIN title) WITHIN book provides limited attribute value searches, no nesting of attribute searches numeric and data values not type-converted no range searches on attribute values return document fragments substring matching search for existing elements or attributes path expressions  structure-oriented queries//Book/Title/[contains(..‘xml‘)] searches for attribute values and element text can be combined considers also decimal values range searches possible using filters Comparison of Queries interMedia Text XPath  • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Search Document IDs of all CLOBs of the XML Table Execute XPath Query on the DOM tree for each CLOB Objects of a Document ID DB XPath Query Doc IDs server-side DocIDs with Result XML Documents XPath Queries with PL/SQL • Prerequisite • XDK for PL/SQL installed on the server • Parse CLOB into DOM representation • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Advantages • Information Preservation • Handling of Large Documents • appropriate for document-centric documents with little structure and prose-rich elements • Different XML Document APIs • interMedia Text: restricted set of XPath functionality • generate a DOM of the document before using XPath queries • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Drawbacks • Restricted Expressiveness of Text Queries • Performance vs. Accuracy of Query Results • interMedia Text queries on CLOBs faster than the DOM-API • sample document: 12.5 MB, parse time 3 min, load time 5 min • Restrictions of Indexes • maximum tag names for indexing (incl. namespace) 64 bytes • Problems with Markup • character entities • Vendor Dependence • text engines are proprietary, e.g., Oracle interMedia • Stability • maximum document size 50 MB • memory errors may occur with smaller documents • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

User Interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Middleware / XML Interface Query Method StorageMethod ExportMethod DeleteMethod DocumentList Transf. to SQL Transf. to XML t n S s S t e e n Q Q m e m L L S m e a Q t S e N D a t L e Row Set a l t c e I e t S l o n S c e s t D t Row Set t e t r e c S c t / t e e a S l S l s t t e e t e a D S a t S m e t I e m c e L L n o e m Q t n Q D t e S S n t Database Server DB XML Database Interface Client XMLDocument DocID /Doc Name DocList XMLQuery • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Import loss of information time-consuming produces lots of tuples Queries XML-QL fast new document as result Import no loss of information faster than structure-oriented decomposition Queries interMedia Text fast only document IDs as result XPath high response time flexible granularity of results Comparison Structure-Oriented Approach Opaque Approach • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Implementation Experience • Import Problems • structure-oriented decomposition • SAX parser produces FillBuf Error for XML documents > 3 MB • limitations of VARCHAR columns (max. 4000 bytes) • opaque Approach • OutOfMemory error during import of XML documents > 4MB • reason: to little heap size in Java • use start option Xmx<HeapSize> • Queries • Opaque Approach • OutOfMemory error when parsing CLOB into DB • increase java_pool_size (100-150 MB) • increase shared_pool_size (150-200 MB) • Export and Delete without Problems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Outlook • Experience • problems with larger documents in both approaches • no universal solution for all requirements • Co-Existence of Multiple Storage Approaches • integrate different storage engines • combine structure-oriented decomposition and opaque approach • need for a generic XML data type • New Data Model for Structure-Oriented Approach • reduce loss of information • XML Database Interface / Middleware • combines different approaches • parameterize the XML database interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

Management of XML Documents without Schema in Relational Database Systems

Management of XML Documents without Schema in Relational Database Systems

Presentation Transcript

Testing Isolation Levels of Relational Database Management Systems

XML to Relational Database Mapping

Comparison of Relational database management systems

Object / Object-Relational Database Management Systems

Relational Database Schema normalizer

XML Schema

Schema Advisor for Hybrid Relational-XML DBMS

XML-to-Relational Schema Mapping Algorithm ODTDMap

Understand Relational Database Management Systems

Relational Database Systems

MD807: Relational Database Management Systems

Management of XML Documents in Object-Relational Databases

Relational Database Systems

Basic Definitions Relational Database Concepts Relational Database Management Systems

Relational Database Systems

Object-relational database systems

ZEIT2301 Design of Information Systems Relational Database Schema

Store XML Data in a Relational Database

XML Schema

Processing XML data using a relational database: Schema-Based XML Storage