Approximate XML Query Answers

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

XML XML Data Motivation • XML: de-facto standard for data exchange • Development of the “XML Warehouse” • Conflict between “on-line” and query execution cost • Increased query response times • Users might wait for un-interesting results Q Warehouse R

Synopsis XML XML XML Data Approximate Query Answers • Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result • Use approximate result as timely feedback • User can assess the “value” of the query • Goal: reduce number of evaluated queries R’ Q Warehouse R

Contributions • TreeSketch Synopses • Structural summaries for XML data • Approximate answers for complex twig queries • Summarization model  Structural clustering of elements • Efficient processing and construction • Element Simulation Distance • Novel distance metric for XML data • Captures “approximate” similarity between two XML trees • Experimental Results • Accurate approximate answers for low space budgets • Low-error selectivity estimates • Efficient construction algorithm

Outline • Preliminaries • TreeSketches • Synopsis model • Computing approximate answers • Summary construction • Element Simulation Distance • Experimental Study • Conclusions

Twig Query r q0 //section p1 q1 ./figure .//equation s2 s3 q2 q3 f6 f7 f5 f4 Nesting Tree Binding Tuples r e10 c12 c13 e8 c9 c11 q0 q1 q2 q3 s2 r s2 f4 e8 r s2 f4 e10 e11 e13 f5 f7 r s2 f5 e8 r s2 f5 e10 Data and Query Model XML Document

r q0 s //section q1 e e f ./figure .//equation q2 q3 Synopsis r s2 e11 e13 f5 f7 XML Data Problem Definition • Process twig query over a synopsis • Compute approximation of nesting tree Approximate Nesting Tree True Nesting Tree

TreeSketch Model

r R(1) p1 P(1) s2 s3 S(2) F(2) F(2) f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 E(2) C(4) Graph Synopsis • Synopsis node  Set of elements of the same tag • Synopsis edge  Document edge(s) XML Document Graph Synopsis

r R(1) 1 p1 P(1) 2 s2 s3 S(2) 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch

r R(1) 1 p1 P(1) 2 s2 s3 S(2) 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) TreeSketch Synopsis • Is there a lossless synopsis? • What is the quality of a lossy synopsis? XML Document TreeSketch

r R(1) p1 P(1) s2 s3 S(2) F(2) F(2) f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 E(2) C(4) Count Stability • (u,v) count-stable: all elements in u have the same child-count in v XML Document TreeSketch 1 2 1 1 1 1 1

r p1 s2 s3 f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 Count-Stable TreeSketch • A count-stable synopsis can recover the input tree • Efficient one-pass construction • Stable summary can be too large for practical use! XML Document TreeSketch R(1) 1 P(1) 1 1 S(1) S(1) 2 2 F(2) F(2) 1 1 1 E(2) C(4)

#F r R(1) 2  1 p1 P(1) 1  2 s2 s3  S(2) 1 2 #F 1 1 F(2) F(2) f6 f7 f5 f4 1 1 1 e10 c12 c13 e8 c9 c11 E(2) C(4) Lossy TreeSketch XML Document TreeSketch

TreeSketches and Clustering • TreeSketch  Element clustering • All elements in a node are mapped to a “centroid” • Tight clusters  Accurate synopsis • Synopsis quality  Clustering error • Options: Manhattan Distance, Squared Error, … • Quality can be measured independent of a workload • Key for effective construction

R(1) 2 1 P(1) S 2 1 1+1=2 S(2) C E 1 1 F(2) F(2) 1 1 1 E(2) C(4) Computing Approximate Answers Query Approximate Nesting Tree TreeSketch • Compute TreeSketch of approximate answer • Accuracy depends on quality of clustering R q0 //section q1 .//caption .//equation q2 q3

TreeSketch Construction • Given an XML tree T, build a TreeSketch of size B • Difficult clustering problem • Space dimensionality depends on the clustering itself • Construction based on bottom-up clustering • Compress perfect synopsis by merging clusters • Best merge determined by marginal gains • Heuristic to reduce number of candidate merges … Space Budget Perfect

Element Simulation Distance

r r r s s s s s s 2 4 1 4 6 4 4 6 1 1 2 1 f f f e e e f f f e e e Error of Approximation • Error  Distance between R’ and R • Popular metric: Tree-edit distance • Min-cost sequence of operations that transform R’ to R • Measures syntactic differences between R and R’ • Not intuitive for approximate answers! Same counts Opposite Trait Different counts Similar Trait T1 T T2

f f Recursive application of ESD r r f s s s s e e e e e e e e e e 1 2 4 6 6 4 2 1 f f e e f f e e T T2 Element Simulation Distance • Capture approximate similarity between R and R’ • u simulates v: u and v have identical structure • ESD(u,v): “degree” of simulation between u,v • How well the structure of u matches the structure of v • Modeled as the distance between multi-sets • Efficient computation using perfect summaries

Experimental Results

Methodology • Data Sets: XMark, DBLP, IMDB, SwissProt • Workload: 1000 random twig queries • Evaluation metrics: • Average ESD for approximate answers • Mean absolute relative error for selectivity estimation

Approximate Answers - IMDB IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples

Selectivity Estimation - ALL

Conclusions • Approximate query answering for XML databases • TreeSketch Synopses • Structural summaries for tree-structured XML • Approximate answers for twig-queries • Model: Graph Synopsis + Edge-counts • Efficient processing and construction • Element Simulation Distance • Capture approximate similarity between XML trees • Experimental Results • High accuracy for low space budgets • Efficient construction

Questions?

#C 1  1 #E TreeSketch Model (2/2) • Average number of children <--> Edge count XML Document TreeSketch r R 1 p1 P(1) 2 S(2) s2 s3 1 1 F(2) F(2) f9 f9 f7 f5 1 1 1 E(2) C(4) e13 c17 c17 e11 c12 c14

XML XML Document r p1 p: paper s: section c: caption t: title f: figure e: equation s2 s3 f9 f9 f7 f5 e13 c17 c17 e11 c12 c14

r p1 s2 s3  2 f6 f7 f5 f4 e10 c12 c13 e8 c9 c11 TreeSketch Synopsis • Augment graph-synopsis with edge counts • count[u,v]: mean #children in v per element in u XML Document TreeSketch R(1) 1 P(1) 2 S(2) #F 2 F(4) 1 0.5 E(2) C(4)

Depth-Guided Merging • Key observation: Two elements have similar structure, if their children have similar structure • Bottom-up merging, based on depth • Depth: distance from the leaves of the tree • Build a pool of candidate merges by increasing depth • Replenish the pool when it falls below a given threshold • Reduced construction time - Accurate synopses

Depth-Guided Merging • Observation: Two elements have similar structure, if their children have similar structure • Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well • Bottom-up merging strategy • Savings in construction time - Accurate synopses

Approximate XML Query Answers