260 likes | 356 Views
Management of XML Documents without Schema in Relational Database Systems. Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics. Overview. Introduction Motivation Main Issues
E N D
Management of XML Documents without Schema in Relational Database Systems Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics
Overview • Introduction • Motivation • Main Issues • Structure-Oriented Approach • Storage / Data Model • Queries • Evaluation • Opaque Approach • Storage • Queries (vs. XPath) • Evaluation • Prototype Implementation • Interface • Experience • Outlook
Motivation • XML is used in: • data publishing (document-centric documents) • data exchange (data-centric documents) • Why XML Documents without Schema? • generated by programs • mostly data-centric documents, e.g., account statements • high update frequency of the document structure evolving schemas • Problems • How to deal with XML documents without DTD / XML Schema in databases? • Evaluate approaches • Use relational database systems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Main Issues • Evaluate Storage and Retrieval Methods • no defined document schema • platform: Oracle 8i • XML-to-Relational Mapping Approaches • structure-oriented decomposition • opaque approach • Identify Parameters of a Unified XML-DB Interface • Implement a Testbed • qualitiative assessment • performance of both approaches • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Structure-Oriented Approach • Characteristics • decomposition of an XML document into smaller units (elements) • depends on the document structure only • target system: relational DBMs, generic schema • Variety of Mapping Methods • model XML document als directed ordered labeled graphs and map them to tables • proposed algorithms: • edge tables [Florescu, Kossmann] • universal table • inlining techniques [Shanmugasundaram et.al.] • model-based fragmentation • Monet XML-model • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Storing XML Data in Relations • XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Data Model 1 1 n n 1 1 1 1 • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook 0/1 0/1 0/1 0/1
Import Algorithm <tree> <person age=“36“> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </tree> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Query Processing • Query Language • XML query language most appropriate • data model of our solution is based on XML-QL XML-QL preferred choice • XML-Relational Mismatch • relational DBMS “understands“ SQL only requires translation from XML-QL to SQL • generate result document from the tuples retrieved • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Query Processing XML-QLQuery Parser ObjectStructure GenerateSQL Statement SQLStatement DB ExecuteSQL Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook ConstructResult Document Row Set XMLDocument
XML-QL Query CONSTRUCT <result> { WHERE <person> <name>$n</name> <address>$a</address> </person> CONSTRUCT <person> <name>$n</name> <address>$a</address> </person> } </result> SQL Statement SELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_Value FROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs E WHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+)) Generate an SQL-Statement • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Construct Result Document • Result Tuple • Subtree Reconstruction SELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+) • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Query Result • XML Result Document <result> <person> <name>Peter</name> <address> <street>Main Road 4</street> <zip>04236</zip> <city>Leipzig</city> </address> </person> </result> • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Advantages • Vendor Independence • no specific DBMS features needed • Stability • High Flexibility of Queries • retrieve and update single values • full SQL functionality can be used • Well-Suited for Structure-Oriented Queries • structures are represented in tables • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Drawbacks • Information Loss • Comments • Processing Instructions • Prolog • CDATA Sections • Entities • Restrictions • only one text (content) per element <element> Text1 <subelement/> Text2 lost </element> • element text as VARCHAR(n); n <= 4000 • Increased Load Time • sample document: 3.3. MB, 130.000 tuples, 13 minutes • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Opaque Approach • Characteristics • XML document stored as Large Object (LOB) • document completely preserved • Storage Insert into tblXMLClob values (1,‘person.xml‘,‘ <person> <name>Mary</name> </person>‘ ); • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Oracle interMedia Text • Query Facilities of interMedia Text • full-text retrieval (word matching only) • path expression only together with content search • no range queries • Example in interMedia Text: SELECT DocId FROM tblXMLClob WHERE CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0 • XML Full-Text Index • Autosectioner Index • XML Sectioner Index • WITHIN operator • text_subquery WITHIN elementname • searches the entire text content of the named tag • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
return document IDs word matching (default) no existence test for elements or attributes restricted set of path expressions using WITHINe.g.: (xml WITHIN title) WITHIN book provides limited attribute value searches, no nesting of attribute searches numeric and data values not type-converted no range searches on attribute values return document fragments substring matching search for existing elements or attributes path expressions structure-oriented queries//Book/Title/[contains(..‘xml‘)] searches for attribute values and element text can be combined considers also decimal values range searches possible using filters Comparison of Queries interMedia Text XPath • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Search Document IDs of all CLOBs of the XML Table Execute XPath Query on the DOM tree for each CLOB Objects of a Document ID DB XPath Query Doc IDs server-side DocIDs with Result XML Documents XPath Queries with PL/SQL • Prerequisite • XDK for PL/SQL installed on the server • Parse CLOB into DOM representation • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Advantages • Information Preservation • Handling of Large Documents • appropriate for document-centric documents with little structure and prose-rich elements • Different XML Document APIs • interMedia Text: restricted set of XPath functionality • generate a DOM of the document before using XPath queries • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Drawbacks • Restricted Expressiveness of Text Queries • Performance vs. Accuracy of Query Results • interMedia Text queries on CLOBs faster than the DOM-API • sample document: 12.5 MB, parse time 3 min, load time 5 min • Restrictions of Indexes • maximum tag names for indexing (incl. namespace) 64 bytes • Problems with Markup • character entities • Vendor Dependence • text engines are proprietary, e.g., Oracle interMedia • Stability • maximum document size 50 MB • memory errors may occur with smaller documents • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
User Interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Middleware / XML Interface Query Method StorageMethod ExportMethod DeleteMethod DocumentList Transf. to SQL Transf. to XML t n S s S t e e n Q Q m e m L L S m e a Q t S e N D a t L e Row Set a l t c e I e t S l o n S c e s t D t Row Set t e t r e c S c t / t e e a S l S l s t t e e t e a D S a t S m e t I e m c e L L n o e m Q t n Q D t e S S n t Database Server DB XML Database Interface Client XMLDocument DocID /Doc Name DocList XMLQuery • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Import loss of information time-consuming produces lots of tuples Queries XML-QL fast new document as result Import no loss of information faster than structure-oriented decomposition Queries interMedia Text fast only document IDs as result XPath high response time flexible granularity of results Comparison Structure-Oriented Approach Opaque Approach • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Implementation Experience • Import Problems • structure-oriented decomposition • SAX parser produces FillBuf Error for XML documents > 3 MB • limitations of VARCHAR columns (max. 4000 bytes) • opaque Approach • OutOfMemory error during import of XML documents > 4MB • reason: to little heap size in Java • use start option Xmx<HeapSize> • Queries • Opaque Approach • OutOfMemory error when parsing CLOB into DB • increase java_pool_size (100-150 MB) • increase shared_pool_size (150-200 MB) • Export and Delete without Problems • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook
Outlook • Experience • problems with larger documents in both approaches • no universal solution for all requirements • Co-Existence of Multiple Storage Approaches • integrate different storage engines • combine structure-oriented decomposition and opaque approach • need for a generic XML data type • New Data Model for Structure-Oriented Approach • reduce loss of information • XML Database Interface / Middleware • combines different approaches • parameterize the XML database interface • Introduction • Structure-Oriented Approach • Opaque Approach • Prototype Implementation • Outlook