260 likes | 433 Views
Approximate XML Query Answers. Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas). XML. XML Data. Motivation. XML: de-facto standard for data exchange Development of the “ XML Warehouse”
E N D
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
XML XML Data Motivation • XML: de-facto standard for data exchange • Development of the “XML Warehouse” • Conflict between “on-line” and query execution cost • Increased query response times • Users might wait for un-interesting results Q Warehouse R
Synopsis XML XML XML Data Approximate Query Answers • Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result • Use approximate result as timely feedback • User can assess the “value” of the query • Goal: reduce number of evaluated queries R’ Q Warehouse R
Contributions • TreeSketch Synopses • Structural summaries for XML data • Approximate answers for complex twig queries • Summarization model Structural clustering of elements • Efficient processing and construction • Element Simulation Distance • Novel distance metric for XML data • Captures “approximate” similarity between two XML trees • Experimental Results • Accurate approximate answers for low space budgets • Low-error selectivity estimates • Efficient construction algorithm
Outline • Preliminaries • TreeSketches • Synopsis model • Computing approximate answers • Summary construction • Element Simulation Distance • Experimental Study • Conclusions
Twig Query r q0 //section p1 q1 ./figure .//equation s2 s3 q2 q3 f6 f7 f5 f4 Nesting Tree Binding Tuples r e10 c12 c13 e8 c9 c11 q0 q1 q2 q3 s2 r s2 f4 e8 r s2 f4 e10 e11 e13 f5 f7 r s2 f5 e8 r s2 f5 e10 Data and Query Model XML Document
r q0 s //section q1 e e f ./figure .//equation q2 q3 Synopsis r s2 e11 e13 f5 f7 XML Data Problem Definition • Process twig query over a synopsis • Compute approximation of nesting tree Approximate Nesting Tree True Nesting Tree
r R(1) p1 P(1) s2 s3 S(2) F(2) F(2) f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 E(2) C(4) Graph Synopsis • Synopsis node Set of elements of the same tag • Synopsis edge Document edge(s) XML Document Graph Synopsis
#F r R(1) 2 1 p1 P(1) 1 2 s2 s3 S(2) 1 2 #F 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch
r p1 s2 s3 2 f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch R(1) 1 P(1) 2 S(2) #F 2 F(4) 1 0.5 E(2) C(4)
TreeSketches and Clustering • TreeSketch Clustering based on structure • All elements in a node are mapped to a “centroid” • Tight clusters Accurate synopsis • The perfect synopsis corresponds to a perfect clustering • Synopsis quality quantified by clustering error • Options: Manhattan Distance, Squared Error, … • Quality can be measured independent of a workload • Key for effective construction
R(1) 2 1 P(1) S 2 1 1+1=2 S(2) C E 1 1 F(2) F(2) 1 1 1 E(2) C(4) Computing Approximate Answers Query Approximate Nesting Tree TreeSketch • Compute TreeSketch of approximate answer • Accuracy depends on quality of clustering R q0 //section q1 .//caption .//equation q2 q3
TreeSketch Construction • Given an XML tree T, build a TreeSketch of size B • Difficult clustering problem • Space dimensionality depends on the clustering itself • Construction based on bottom-up clustering • Compress perfect synopsis by merging clusters • Best merge determined by marginal gains … Space Budget Perfect
Depth-Guided Merging • Key observation: Two elements have similar structure, if their children have similar structure • Children clusters should be merged first • Bottom-up merging, based on depth • Depth: distance from the leaves of the tree • Build a pool of candidate merges by increasing depth • Replenish the pool when it falls below a given threshold • Improved construction time - good performance
Outline • Preliminaries • TreeSketches • Synopsis model • Computing approximate answers • Summary construction • Element Simulation Distance • Experimental Study • Conclusions
r r r s s s s s s 2 4 1 4 6 4 4 6 1 1 2 1 f f f e e e f f f e e e Error of Approximation • Error Distance between R’ and R • Popular metric: Tree-edit distance • Min-cost sequence of operations that transform R’ to R • Measures syntactic differences between R and R’ • Not intuitive for approximate answers! Same counts Opposite Trait Different counts Similar Trait T1 T T2
r r r s s s s s s 2 4 1 4 6 4 4 6 1 2 1 1 f f f e e e f f f e e e Element Simulation Distance • Capture approximate similarity between R and R’ • u simulates v: u and v have identical structure • ESD(u,v): “degree” of simulation between u,v • How well the structure of u matches the structure of v • Modeled as the distance between multi-sets • Efficient computation using perfect summaries T1 T T2
Outline • Preliminaries • TreeSketches • Synopsis model • Computing approximate answers • Summary construction • Element Simulation Distance • Experimental Study • Conclusions
Experimental Methodology • Data Sets: XMark, DBLP, IMDB, SwissProt • Workload: 1000 random twig queries • Evaluation metrics: • Average ESD for approximate answers • Mean absolute relative error for selectivity estimation
Approximate Answers IMDB (~102K Elements) Avg. Result Size: 3,477 tuples
Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples
Conclusions • Approximate query answering for XML databases • TreeSketch Synopses • Structural summaries for tree-structured XML • Approximate answers for twig-queries • Model: Graph Synopsis + Edge-counts • Efficient processing and construction • Element Simulation Distance • Capture approximate similarity b/w XML trees • Experimental Results • High accuracy for low space budgets • Efficient construction
#C 1 1 #E TreeSketch Model (2/2) • Average number of children <--> Edge count XML Document TreeSketch r R 1 p1 P(1) 2 S(2) s2 s3 1 1 F(2) F(2) f9 f9 f7 f5 1 1 1 E(2) C(4) e13 c17 c17 e11 c12 c14
XML XML Document r p1 p: paper s: section c: caption t: title f: figure e: equation s2 s3 f9 f9 f7 f5 e13 c17 c17 e11 c12 c14