450 likes | 619 Views
Efficient XML Storage, Query, and Update . Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo. XML Storage Methods. Flat Streams Metamodeling Mixed Redundant Hybrid. Method Covered. “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix.
E N D
Efficient XML Storage, Query, and Update Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo
XML Storage Methods • Flat Streams • Metamodeling • Mixed • Redundant • Hybrid
Method Covered • “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix. • “Efficient relational storage and retrieval of XML documents” covers Metamodeling using their Monet database.
Natix Overview • Natix is an efficient, native repository for storing, retrieving and managing XML documents. • It supports tree-structured objects like XML documents at low architecture level.
Logic Model • Tree is often used in logic model of semistructured data. • Each non-leaf node is labeled with a symbol taken from an alphabet DTD. • Leaf nodes can be labeled as the data itself.
A sample XML with its associated logical tree Example XML: <SPEECH> <SPEAKER>OTHELLO</SPEAKER> <LINE>Let me see your eyes;</LINE> <LINE>Look in my face.</LINE> </SPEECH>
Physical Model • Object Content: • Node and objects are used interchangeably. • A record contains a set of nodes/objects. • Aggregate nodes are inner nodes of the tree. They contain their respective child nodes. • Literal nodes are leaf nodes containing an uninterpreted stream of bytes, like text strings, graphics, etc. • Proxy nodes are nodes which point to different records.
Node Representation • Whole documents (or subtrees of documents) can be stored in one record. • Each record contains exactly one subtree. • The root nodes of each record’s subtree are called standalone objects, other nodes are called embedded objects. • The record size has an upper limit, the page size.
Large Trees • For a large tree, physical model must provide a mechanism for distributing data trees over several pages. • Method 1: “flat” representation. It wastes the available structural information about the data. • Method 2: split large objects based on the underlying tree structure. • Use proxy objects to connect subtrees of the large object residing in other records.
A Sample Distribution of logical nodes on records • Proxies (p1, p2) • Helper aggregate objects (h1, h2) • Scaffolding objects include proxies and helper aggregates. • Facade objects (f i)
Dynamic maintenance of an efficient storage • The principle problem is that a record containing a subtree can grow larger than a page if a node is added or grows. • Subtree contains in the record has to be partitioned into several subtrees. • Scaffolding nodes link the new records together in the physical tree.
Tree Growth Procedure • Step 1: Determine the record r into which the node has to be inserted. • Step 2: If there is not enough on the page, try to move r. If the record still does not fit, split the record: • (a) Determine the separator by recursively descending into the r’s subtree • (b) Distribute the resulting partitions onto records • (c) Insert the separator into the parent record, recursively calling this procedure • Step 3: Insert the new node
Determining the Insertion Location • There are several possibilities to insert a new node f n into the physical tree. • This choice can be determined by a configuration parameters.
Determining the separator • Separator – a tree structure with proxies pointing to the new records to indicate where which part of the old record was moved. • Consists of all the nodes on the path from d to the subtree’s root. • Partition the tree into left partition L, right partition R and Separator S.
Splitting a Record • Distributing the nodes on records • After determining the partitioning, the contents of the record has to be distributed onto new records. • Each resulting subtree is then stored in its own record, called partition records. • Inserting the separator • The separator is moved to the parent record.
Split Algorithm • Find a node d, such that the resulting L and R. • The ratio between the sizes of L and R is determined by a configuration parameter (split target). • Another configuration parameter Split tolerance specifies the minimum size for the subtree of d. It is used to prevent fragmentation.
Performance Test • XML markup version of Shakspeare’s play with 8MB with 320,000 nodes. • Pentium-II 333Mhz with 128MB under Windows NT4.0 with IBM DCAS 34330 disk. • The implementation of the record and tree storage managers was done in C++.
Test Conditions • Record:Node 1:1 indicating smart record splitting being inhibited. • Record:Node 1:n indicating that the algorithm has full control over distribution of nodes on records. • Incremental updates distributed over the whole document. • Updates in pre-order (append).
Queries • Retrieve all speakers in the third act and second scene of every play, which means it accesses all leaf nodes of a certain type in one selected subtree of the document. • Recreate the textual representation of the complete first speech in every scene, hence reading a lot of small contiguous fragments of each document. • A simple path query was evaluated by reading only the opening speech of each play.
Monet Model • XML document is decomposed into binary relations. • Efficient for storage and retrieval of XML documents in a relational database. • The database used is their Monet database server which supports the Monet model.
Some Definitions • An XML document is a rooted treed = (V, E, r, labelE, labelA, rank) with nodes V and edges EVV and a distinguished node rV. • The function labelE : Vstring assigns labels to nodes • labelA : Vstringstring assigns pairs of strings, attributes and their values, to nodes. • rank : Vint establishes a ranking to allow for an order among nodes with the same parent node.
A sample XML document <bibliography> <article key=“BB88”> <author>Ben Bit</author> <title>How to Hack</title> </article> <article key=“BK99”> <editor>Ed Itor</editor> <author>Bob Byte</author> <author>Ken Key</author> <title>Hacking & RSI</title> </article> </bibliography>
Monet Transform • Given an XML document d, the Monet transform is a quadruple Mt(d)=(r,R,A,T) where • R is the set of binary relations that contain all associations between nodes; • A is the set of binary relations that contain all associations between nodes and their attribute values, including character data; • T is set of binary relations that contain all pairs of nodes and their rank; • r is the root of the document;
Assessment • Implemented within the Monet database server • Tested on 550 MHz Silicon Graphics 1400 Server with 1 GB main memory. • Also used Sun UltraSparc-IIi with 360 MHz and 256 MB main memory to contrast with a related work.
Scaling of Document • Scaled the ACM Anthology from 30 to 3x106 which corresponds to XML source size between 10KB and 1GB. • Run 4 queries consisting of path expressions of length 1 through 4 for various sizes of the anthology.
Comparison of response time for query set of SYU, another method for storage/retrieval of XML document.
Compare/Contrast Natix and Monet • Natix uses custom database while Monet is built on top of relational database • Neither uses DTD. • Natix focuses on XML query as well as update. • Monet focuses on XML storage and query. • Though lacking equivalent test, Monet is faster than Natix on query. • Monet seems to be more space efficient than Natix as well.
References • “Efficient storage of XML data” By Carl-Christian Kanne, et al. ICDE 2000 http://citeseer.nj.nec.com/kanne99efficient.html • “Efficient Relational Storage and Retrieval of XML Documents” By Albrecht Schmidt, et al. WebDB 2000 http://www.research.att.com/conf/webdb2000/program.html