320 likes | 590 Views
STHoles: A Multidimensional Workload-Aware Histogram. Nicolas Bruno* Columbia University. Surajit Chaudhuri Microsoft Research. Luis Gravano* Columbia University. * Work done in part while the authors were visiting Microsoft Research. SIGMOD 2001. Histograms as Succinct Data Set Summaries.
E N D
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno*Columbia University Surajit ChaudhuriMicrosoft Research Luis Gravano*Columbia University * Work done in part while the authors were visiting Microsoft Research. SIGMOD 2001
Histograms as Succinct Data Set Summaries • Used for selectivity estimation and approximate query processing. • Data set partitioned into buckets, each approximated by aggregate statistics.
Histograms • Each bucket consists of a bounding box and a tuple frequency value. • Uniformity is assumed inside buckets. • Histograms should partition data set in buckets with uniform tuple density. • Multi-dimensional data makes partitioning even more challenging.
Outline • Overview of existing multidimensional histogram techniques. • Introduction to STHoles histograms. • System architecture and STHoles construction algorithm. • Experimental evaluation.
Histograms Techniques: EquiDepth • Correctly identifies core of densest clusters. • Partitioning uses “equi-count” instead of “equi-density” Gaussian Data Set EquiDepth Histogram [Muralikrishna and DeWitt 1988]
Histogram Techniques: MHist • Works well for highly skewed data distributions. • Devotes too many buckets to the densest clusters. • Bad initial “choices” are amplified in later steps. Gaussian Data Set MHist Histogram [Poosala and Ioannidis 1997]
Histogram Techniques: GenHist • More robust than previous techniques (based on multidimensional information). • Difficult to choose right values of various parameters. • Requires at least 5-10 passes over the data. GenHist Histogram [Gunopulos et al. 2000] Gaussian Data Set
Histogram Techniques: STGrid • Incorporates feedback from query execution. • Grid partitioning strategy is sometimes too rigid. • Focuses on efficiency rather than accuracy. Gaussian Data Set STGrid Histogram [Aboulnaga and Chaudhuri 1999]
Our New Histogram Technique: STHoles • Flexible bucket partitioning. • Exploits workload information to allocate buckets. • Query feedback captures uniformly dense regions. • Does not examine actual data set.
Non rectangular region STHoles Histograms • Tree structure among buckets. • Buckets with holes: relaxes rectangular regions while using rectangular bucket structures.
System Architecture for STHoles Range Query
STHoles Construction Algorithm • Initialize histogram H as an empty histogram. • For each query q in workload: 1- Gather simple statistics from query results. 2- Identify candidate holes and drill (add)them as new buckets in H. 3- Merge superfluous buckets in H.
? Drilling New Candidate Buckets For each query q in workload and bucket b in histogram: • Count how many tuples in result stream lie inside qb. • Drill qb as a new bucket (child of b). q
Shrinking Candidate Buckets • Partition constraint: Bounding boxes must be rectangular. • Apply greedy technique to shrink a candidate hole to a rectangle.
Merging Buckets • To avoid exceeding available space. • Merge most “similar” buckets in terms of tuple density.
Parent-Child Merges Eliminate buckets too similar to their parents. Example: The interesting region in bc is covered by its child b1.
Sibling-Sibling Merges • Consolidate buckets with similar densities that cover close regions. • Extrapolate frequency distributions to yet unseen regions.
An Example STHoles Histogram Gaussian Data Set STHoles Histogram
Experimental Setting • Data Sets: • Real: (UCI Repository) • Sample of Census data set (200K tuples) • Cover data set (500K tuples) • Synthetic: Variations of Gaussian and Zipfian(Array) distributions. 200K to 500K tuples, 2 to 4 dimensions. • Histograms: • 1024 available bytes per histogram. • EquiDept, MHist, GenHist, STGrid, STHoles.
Accuracy Metric: Absolute Error. Census data set Biased (tuples) workload Gaussian (area) workload (with some normalization; details in paper) Experimental Setting (cont.) • Workloads [Pagel et al. 1993]: • 1,000 queries. • Query centers follow different distributions: Uniform, Biased, Gaussian. • Query boundaries follow different constraints: area covered, tuples covered.
Biased workload, query boundaries cover around 1% of the data domain Comparison with Other Approaches: Biased Workload
Uniform workload, query boundaries cover around 1% of the data set tuples. Comparison with Other Approaches: Uniform Workload
Biased workload Convergence with Workload
Handling Data Set Updates From Gaussian to Zipfian data distributions.
Other Experiments • Varying: • data skew. • data dimensionality. • histogram size. • workload generation parameters. • number of attributes in queries. • Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8%. • STHoles lead to robust selectivity estimates across data distributions and workloads. • See full paper for details!
Summary: STHoles, a Multidimensional Workload-Aware Histogram • Exploits query feedback. • Built without examining data set. • Allows bucket nesting to capture complex shapes using only rectangular bucket structures. • Results in robust and accurate selectivity estimations. • In many cases, outperforms the best techniques that access full data sets.
Related Work (Histograms) • Unidimensional: • EquiDepth [Piatetsky-Shapiro and Connell 1984] • MaxDiff [Poosala et al. 1996] • V-Optimal [Jagadish et al. 1998] • Many more! • Multidimensional: • EquiDepth [Muralikrishna and DeWitt 1988] • MHist [Poosala and Ioannidis 1997] • GenHist [Gunopulos et al. 2000] • STGrid [Aboulnaga and Chaudhuri 1999]
Related Work (Other Techniques) • Sampling [Olken and Rotem 1990] • Wavelets [Matias et al. 1997] • Discrete transformations [Lee et al. 1999] • Parametric Curve Fitting [Chen and Roussopoulos 1994]
Evaluation Metric • Absolute Error: • Normalized Absolute Error:
Zipfian Data Set Census Data Set Gaussian Data Set Varying Histogram Size
Gaussian Data Set Zipfian Data Set Census Data Set Varying Spatial Selectivity