250 likes | 369 Views
Query Caching and View Selection for XML Databases. Bhushan Mandhani Dan Suciu University of Washington Seattle, USA. Caching in Query Processing. Buffer Cache: In-memory pool of disk pages. Semantic Cache: Stores query results instead. Can be in-memory or on disk.
E N D
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA
Caching in Query Processing • Buffer Cache: In-memory pool of disk pages. • Semantic Cache: • Stores query results instead. • Can be in-memory or on disk. • Proposed in [Franklin et al., VLDB 96] for use in client-server systems.
Semantic XPath Cache • A cached query result is a materialized view. • View V answers query Q if there exists C s.t C º V = Q • C, V & Q are all XPath here. • Cache contains some views {V1,…,Vn}. • Query Q is a hit if some Vi answers it.
Motivation for the Semantic Cache • C º V = Q : a cache hit results in a simpler query, on a much smaller XML fragment. • Cache hits were processed two orders of magnitude faster than misses. • Can also be maintained outside the database • Ex: application tier. • Application tier caching for database-driven websites has become increasingly popular [Mohan et al., SIGMOD 02].
Example Illustrating C º V = Q • V = /a[u[@v]/w]/b[x//y] • Q = /a[u[w]/@v]/b[x/y//z]/c • C = /b[x/y//z]/c
The Query/View Answerability Problem Given view V and query Q, does V answer Q, and if yes, then what should C be?
XPath Tree Patterns • XPath queries have natural representation as tree patterns. • E.g. If Q = a[v]/b[@w=''val1''][x[.//y]]//c[z>val2] • The Query Axis is the path from the root node to the result node. • The Query Depth is the number of axis nodes. • Prefix(Q, k) is the query obtained by truncating Q at its k-th axis node. • E.g. Prefix(Q, 2) = a[v]/b • Preds(Q, k) is the set of predicates of the k-th axis node of Q. • E.g. Preds(Q, 2) ={[@w=''val1''], [x[.//y]]}
XPath Containment • A С B if the result of A will always be a subset of the result of B. • Checking XPath containment is coNP-complete [Miklau & Suciu, PODS 02]. • A containment mapping maps nodes in B’s tree pattern to those in A’s, and establishes A С B. • It is a sound but incomplete condition. • It can be determined in polynomial time. • Containment is different from answerability. • E.g. let A = /a[x]/b, B = /a/b
Criteria for Query/View Answerability “Rewriting XPath Queries using Materialized Views”, Xu and Ozsoyoglu, VLDB 2005
An Example of Answerability Checking • Let V = /a[u[@v]/w]/b[x//y] • Let Q = /a[u[w]/@v]/b[x/y//z]/c • Prefix(V, 2) = /a[u[@v]/w]/b and Prefix(Q, 2) = /a[u[w]/@v]/b. The first condition is satisfied. • Preds(V, 2) = {[x//y]} and Preds(Q, 2) = {[x/y//z]}. The second condition is also satisfied.
Two Theoretical Results Theorem:If two tree patterns are minimal, and containment mappings exist both ways, then they are isomorphic. Theorem:If some view V answers query Q, then C can be set to the subtree of Q rooted at its k-th axis node, where k is the query depth of V.
The Cost of Answerability Checking • Answerability checking involves tree operations: • Checking isomorphism between trees. • Looking for containment mappings between trees. • Cache lookup by checking each view could give a high lookup overhead. • Our Solution: Check answerability by string matching. This will be: • Cheaper • Amenable to indexing in a standard RDBMS.
Checking Tree Isomorphism • We represent each tree pattern with its “normalized” query string. • To obtain the normalized query: • At each axis node, do a DFS into each predicate subtree. • Before returning from a node, append its children node labels to its label, in lexicographic order. • Concatenate all axis node labels to get the normalized query.
Checking Predicate Containment • For the tree of each predicate in Preds(Q, k), generate all trees that “containment map” to it. • ConPreds(Q, k) is the set of normalized predicates obtained from these generated trees.
String-Based Criteria for Answerability The second condition is tweaked to correctly support comparison predicates.
Cache Organization • Let V = /a[u]/b[v][@x=50][y/z=“str”]. • Insert into XmlData (‘<result></result>’). Record viewId. • Insert into Prefix (‘/a[u]/b’). Record prefixId. • Insert into View (viewId, prefixId, ‘@x=50’, ’y/z=“str”|v’).
Cache Lookup • Prefix(V, k) = P.prefix = Prefix(Q, k). • V.predЄ Preds(V, k) and V.predЄ ConPreds(Q, k).
Cache Warm-up as View Selection • Given the warm-up workload W, we generate anotherworkload S. • S is likely to have overlap with the test workload. • Problem: Choose some m views from S which together answer a maximal subset of S. • This is a variant of the set cover problem, which is NP-complete. • We use a variant of the greedy approximation algorithm for set cover.
Experimental Setup • The data instance was a 300 MB XML document created using the XMark generator. • Expts run on a Pentium 4, 512 MB RAM machine. • SQL Server 2005 beta used for storing the XML data as well as the cache. • We compare our cache with a naïve cache which stores (query string, result XML) pairs.
Query Workloads Used • We implemented our own XPath query generator for creating workloads for our expts. • Query structure is generated randomly. • Value for each predicate is obtained by sampling from its set of values taken, using Zipf distribution. • Queries with depths 3, 4 and 5 were given 2, 2 and 3 predicates respectively.
Conclusions • We defined a notion of query/view answerability for XPath. • We showed how an efficient semantic cache based on these ideas can be employed. • We demonstrated the scalability of cache lookup, and performance gains in query processing.