1 / 26

Management of XML Documents without Schema in Relational Database Systems

Management of XML Documents without Schema in Relational Database Systems. Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics. Overview. Introduction Motivation Main Issues

Download Presentation

Management of XML Documents without Schema in Relational Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Management of XML Documents without Schema in Relational Database Systems Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics

  2. Overview • Introduction • Motivation • Main Issues • Structure-Oriented Approach • Storage / Data Model • Queries • Evaluation • Opaque Approach • Storage • Queries (vs. XPath) • Evaluation • Prototype Implementation • Interface • Experience • Outlook

  3. Motivation • XML is used in: • data publishing (document-centric documents) • data exchange (data-centric documents) • Why XML Documents without Schema? • generated by programs • mostly data-centric documents, e.g., account statements • high update frequency of the document structure evolving schemas • Problems • How to deal with XML documents without DTD / XML Schema in databases? • Evaluate approaches • Use relational database systems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  4. Main Issues • Evaluate Storage and Retrieval Methods • no defined document schema • platform: Oracle 8i • XML-to-Relational Mapping Approaches • structure-oriented decomposition • opaque approach • Identify Parameters of a Unified XML-DB Interface • Implement a Testbed • qualitiative assessment • performance of both approaches • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  5. Structure-Oriented Approach • Characteristics • decomposition of an XML document into smaller units (elements) • depends on the document structure only • target system: relational DBMs, generic schema • Variety of Mapping Methods • model XML document als directed ordered labeled graphs and map them to tables • proposed algorithms: • edge tables [Florescu, Kossmann] • universal table • inlining techniques [Shanmugasundaram et.al.] • model-based fragmentation • Monet XML-model • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  6. Storing XML Data in Relations • XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  7. Data Model 1 1 n n 1 1 1 1 • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook 0/1 0/1 0/1 0/1

  8. Import Algorithm <tree> <person age=“36“> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  9. Query Processing • Query Language • XML query language most appropriate • data model of our solution is based on XML-QL XML-QL preferred choice • XML-Relational Mismatch • relational DBMS “understands“ SQL only  requires translation from XML-QL to SQL • generate result document from the tuples retrieved • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  10. Query Processing XML-QLQuery Parser ObjectStructure GenerateSQL Statement SQLStatement DB ExecuteSQL Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook ConstructResult Document Row Set XMLDocument

  11. XML-QL Query CONSTRUCT <result> { WHERE <person> <name>$n</name> <address>$a</address> </person> CONSTRUCT <person> <name>$n</name> <address>$a</address> </person> } </result> SQL Statement SELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_Value FROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs E WHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+)) Generate an SQL-Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  12. Construct Result Document • Result Tuple • Subtree Reconstruction SELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+) • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  13. Query Result • XML Result Document <result> <person> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </result> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  14. Advantages • Vendor Independence • no specific DBMS features needed • Stability • High Flexibility of Queries • retrieve and update single values • full SQL functionality can be used • Well-Suited for Structure-Oriented Queries • structures are represented in tables • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  15. Drawbacks • Information Loss • Comments • Processing Instructions • Prolog • CDATA Sections • Entities • Restrictions • only one text (content) per element <element> Text1 <subelement/> Text2  lost </element> • element text as VARCHAR(n); n <= 4000 • Increased Load Time • sample document: 3.3. MB, 130.000 tuples, 13 minutes • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  16. Opaque Approach • Characteristics • XML document stored as Large Object (LOB) • document completely preserved • Storage Insert into tblXMLClob values (1,‘person.xml‘,‘ <person> <name>Mary</name> </person>‘ ); • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  17. Oracle interMedia Text • Query Facilities of interMedia Text • full-text retrieval (word matching only) • path expression only together with content search • no range queries • Example in interMedia Text: SELECT DocId FROM tblXMLClob WHERE CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0 • XML Full-Text Index • Autosectioner Index • XML Sectioner Index • WITHIN operator • text_subquery WITHIN elementname • searches the entire text content of the named tag • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  18. return document IDs word matching (default) no existence test for elements or attributes restricted set of path expressions using WITHINe.g.: (xml WITHIN title) WITHIN book provides limited attribute value searches, no nesting of attribute searches numeric and data values not type-converted no range searches on attribute values return document fragments substring matching search for existing elements or attributes path expressions  structure-oriented queries//Book/Title/[contains(..‘xml‘)] searches for attribute values and element text can be combined considers also decimal values range searches possible using filters Comparison of Queries interMedia Text XPath  • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  19. Search Document IDs of all CLOBs of the XML Table Execute XPath Query on the DOM tree for each CLOB Objects of a Document ID DB XPath Query Doc IDs server-side DocIDs with Result XML Documents XPath Queries with PL/SQL • Prerequisite • XDK for PL/SQL installed on the server • Parse CLOB into DOM representation • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  20. Advantages • Information Preservation • Handling of Large Documents • appropriate for document-centric documents with little structure and prose-rich elements • Different XML Document APIs • interMedia Text: restricted set of XPath functionality • generate a DOM of the document before using XPath queries • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  21. Drawbacks • Restricted Expressiveness of Text Queries • Performance vs. Accuracy of Query Results • interMedia Text queries on CLOBs faster than the DOM-API • sample document: 12.5 MB, parse time 3 min, load time 5 min • Restrictions of Indexes • maximum tag names for indexing (incl. namespace) 64 bytes • Problems with Markup • character entities • Vendor Dependence • text engines are proprietary, e.g., Oracle interMedia • Stability • maximum document size 50 MB • memory errors may occur with smaller documents • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  22. User Interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  23. Middleware / XML Interface Query Method StorageMethod ExportMethod DeleteMethod DocumentList Transf. to SQL Transf. to XML t n S s S t e e n Q Q m e m L L S m e a Q t S e N D a t L e Row Set a l t c e I e t S l o n S c e s t D t Row Set t e t r e c S c t / t e e a S l S l s t t e e t e a D S a t S m e t I e m c e L L n o e m Q t n Q D t e S S n t Database Server DB XML Database Interface Client XMLDocument DocID /Doc Name DocList XMLQuery • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  24. Import loss of information time-consuming produces lots of tuples Queries XML-QL fast new document as result Import no loss of information faster than structure-oriented decomposition Queries interMedia Text fast only document IDs as result XPath high response time flexible granularity of results Comparison Structure-Oriented Approach Opaque Approach • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  25. Implementation Experience • Import Problems • structure-oriented decomposition • SAX parser produces FillBuf Error for XML documents > 3 MB • limitations of VARCHAR columns (max. 4000 bytes) • opaque Approach • OutOfMemory error during import of XML documents > 4MB • reason: to little heap size in Java • use start option Xmx<HeapSize> • Queries • Opaque Approach • OutOfMemory error when parsing CLOB into DB • increase java_pool_size (100-150 MB) • increase shared_pool_size (150-200 MB) • Export and Delete without Problems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

  26. Outlook • Experience • problems with larger documents in both approaches • no universal solution for all requirements • Co-Existence of Multiple Storage Approaches • integrate different storage engines • combine structure-oriented decomposition and opaque approach • need for a generic XML data type • New Data Model for Structure-Oriented Approach • reduce loss of information • XML Database Interface / Middleware • combines different approaches • parameterize the XML database interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook

More Related