280 likes | 402 Views
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu. Outline. Introduction Preliminaries NoK pattern matching at the logical level Physical storage
E N D
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu.
Outline • Introduction • Preliminaries • NoK pattern matching at the logical level • Physical storage • XML path queries at the physical level • Experimental evaluation • Conclusion
Introduction • The increasingly wider use of XML leads to • the need to store large volumes of data encoded in XML • the need to query XML data more efficiently • Path expressions are the most natural way to query tree-structured data such as XML tree • evaluate path expressions against XML tree – tree pattern matching • a path expression: a pattern tree that specifies a set of constraints • TPM problem: to find the nodes in the XML tree that satisfy all the constraints
Existing evaluation approach • Navigational Approach • traverse the tree structure • test whether a tree node satisfies the constraints by the path expression • Join-based Approach • Select a list of XML tree nodes that satisfy the node-associated constraints for each pattern tree node • Join the lists based on their structural relationship • However, these two are not adaptive to the streaming XML data.
A Novel Approach • Define a special pattern tree and pattern matching • Next-of-Kin pattern tree in which nodes are connected by parent-child and following/preceding-sibling relationship only • Next-of-Kin pattern matching • speed up the node selection step • reduce the join size in the second step • Design a novel, succinct physical storage scheme • support efficient NoK query evaluation
Consider the bibliography XML <bib> <book year=“1994”> <title>TCP/IP Illustrated</title> <author><last>Stevens</last> <first>W.</first></author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book> <book year=“1992”> <title>Advanced Programming in the Unix</title> <author><last>Stevens</last> <first>W.</first></author> <publisher>Addison-Wesley</publisher> <price>65.95</price> </book> <book year=“2000”> <title>Data on the Web</title> <author><last>Abiteboul</last> <first>Serge</first></author> <author><last>Buneman</last> <first>Peter</first></author> <author><last>Suciu</last> <first>Dan</first></author> <publisher>Morgan Kaufmann Publisher</publisher> <price>39.93</price> </book> <book year=“1999”> <title>The Economics of Technology</title> <editor><last>Gerbarg</last><first>Darcy</first> <affiliation>CiTI</affiliation> </editor> <publisher>Kluwer Academic Publisher</publisher> <price>129.95</price> </book> </bib> Preliminaries
Subject tree • Subject tree or XML tree a b b b b z e c i j z e c i j z e c c c i j z e d i j f g f g f g f g f g f g Note: bib-> a book-> b @year->z author->c title->e publisher-> i price-> j first->f last-> g editor->d
Query: “find all books written by Stevens whose price is less than 100”. Path expression: //book[author/last=“Stevens”] [price<100]. Pattern tree A graphical representation of constraints specified in a path expression root // book / / author price<100 / Last=“Stevens” Pattern tree
Nok pattern matching at the logical level • Next-of-Kin pattern tree: • Consists of edges whose labels are in {parent-child relationship, following-sibling relationship}. • Two steps in the process of matching Nok pattern tree to the subject tree: • Locate the nodes in the subject tree to start pattern matching; • Nok pattern matching from that starting node.
Locate the starting node • Many options to locate the starting point: • Naïve approach: traverse the whole subject tree in document order and try to match each node with the root of the Nok pattern tree; • Index on tag names: If have a B+ tree on tag names, an index lookup for the root of the NoK pattern tree will generate all possible starting points; • Index on data values: If there are value constraints in the NoK pattern tree (such as last=“Stevens”) and we have a B+ tree for all values in XML document, we can use that value-based index to locate all nodes having the particular value and use them as the starting points.
Consider the subject tree and NoK pattern tree with tag names: b[c/g=“Stevens”][j<100] Suppose : the starting point snode -- the first node b of subject tree, which matches proot and is appended to the result set R iterates over b’s children to check whether they match any node in the set {c,j}; third node of snode matches with c, a recursive call will be invoked to match the NoK pattern tree c/g=“Stevens” with the subject tree rooted at snode/c; The recursive call returns True, check the other children and eventually j is matched, causing the set = 0; The result R contains the starting point b. z e c f g Example b i j
Physical storage • Desideration for designing the physical storage scheme are: • Structural information should be stored separately from the value information. • The subject tree should be “materialized” to fit into the paged I/O model. • The storage scheme should have enough auxiliary information (e.g.indexed on values and tag names) to speed up Nok pattern matching. • The storage scheme should be adaptable to support updates.
Value information storage • Based on two observations, value information and structural information should be stored separately: • An XML document is a mixture of structural information and value information; • Any path query can be divided into two subqueries: pattern matching on the tree structural and selection based on values. • Example: Path expression: //book[author/last=“Stevens”][price<100]. structural constraints: //book[author/last][price] value constraints: last=“Stevens” and price<100 • Separating structural and value information --- separate the different concerns and address each appropriately • B+ tree on the value information; • path index or tag name index on the structural information.
B+ tree Dewey ID-> Pointer to value in data file Data File Value information storage(cont) • Maintain connection between structural and value information • Use Dewey ID as key of tree nodes to reconnect, e.g. Dewey ID of root a =0, Dewey ID of its second child b =0.2 ; • Given a Dewey ID, another B+ tree to locate value of node in the data file. B+ tree HashedValue-> Dewey ID
Value information storage(cont) • In the data file, each element content could be represented by a binary tuple (len,value) e.g. (4,”1994”),(7,”Stevens”),(5,”69.95”) • Dewey ID B+ tree: position of these records in the data file. • More than one node with same value, just keep one copy and let these nodes point to the same position.
(a (b) (c)) a b) c) Structural information storage • Store the nodes in pre-order and keep the tree structure by inserting pairs of parentheses. • E.g. (a(b)(c)) – represent the tree that has a root a and its two children b and c “(”: indicate the beginning of a subtree; “)”: indicate the end of the subtree. • Each node implies an open parentheses, so
The depth of node from the root. String representation The string representation of an XML tree
(st,lo,hi) nextpage Structural information storage(cont) • For each page, an extra tuple (st, lo, hi) is stored, where st: the level of the last node in the previous page, lo and hi: the minimum and maximum levels of all nodes in that page. • Page layout for structural info. Header String Representation Reserved for update a b z) e) c f ) g ) ) i) j ) ) b z ) e ) c f
Using extra tuple (st,lo,hi) can guess the page where the following sibling or parent is located. Easy to insert nodes into the string representation of the tree E.g. to insert a b) c)) as a subtree of the first f node in page 1: Allocate a new page with the content a b) c)); Cut-and-paste the content after f in page 1 to the end of content of the new page; Insert the new page between page 1 and 2; Update the tuple (st,lo,hi) information for page 1. in page 1: a b z) e) c f ) g)) new page: a b) c)) construct new page: a b) c)) ) g)) a b z) e) c f a b)c )) ) g)) i) j)) b z)e) d f Advantages for page layout
XML path queries at the physical level • In the Nok pattern matching, the only operation on the subject tree is the iteration over children of a specific node. • Using the physical storage technique, this operation is divided into: • find the first child of a specific node • find the following sibling of a node • According to the pre-order property of the string representation, these two operations can be performed by looking at the node level information of each page from left to right without reconstructing the tree structure.
Example • Find the first child of character b in the first page. • The first child of b must be the next character if it is not “)”. • If b is at level L ,the first child of b should at level L+1. • Answer: right neighbor z
Find b’s following sibling. The following sibling must be located to the right of b in the string and its level must be the same as b’s. Answer: b in page 2. Example
Experimental Setting • Selected queries are based on the following three properties of path expression: • Selectivity: a path expression returning a small number of results should be evaluated faster than those returning a large number; • Topology: the shape of the pattern tree could be a single path or bushy. • Value constraints: the existence of value constrains and index on values may be used for fast locating the starting point for Nok pattern matching.
Conclusion • Have defined a special type of pattern tree – NoK pattern tree; • Proposed a novel approach for efficient evaluating path expression by NoK pattern matching; • NoK pattern matching can be evaluated efficiently using the physical storage scheme; • Performance evaluation has shown that this system is better or comparable performance than the existing systems.
Limitation • More optimization on the locating step of NoK pattern tree matching process. • Use path index instead of tag-name index. • Consider how to employ concurrency control and how it affect the update process.
Reference • Ning Zhang, Varun Kacholia, M.Tamer Ozsu. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. • D. Chamberlin, P. Fankhauser, M. Marchiori, and J. Robie. XML Query Use Case. Available at http://www.w3.org/TR/xmlquery-use-case. • E.Cohen, H. Kaplan, S. Padmanabhan, and R. Bordawekar. Labeling Your XML. Preliminary version presented at CASCON’02, October 2002. • N. Zhang and M. T. Ozsu. Optimizing Correlated Path Expressions in XML Languages. Technical Report CS-2002-36, University of Waterloo, November 2002. Available at http://db.uwaterloo.ca/~ddbms/publications/xml/TR-CS-2002-36.pdf.
Thank You ! Question?