Self-tuning Histograms Building Histograms Without Looking at Data

Self-tuning HistogramsBuilding Histograms Without Looking at Data By Ashraf Aboulnage & Surajit Chaudhuri Represents By Miller Ofer

Traditional histograms • Histograms impose little cost at queries , especially in a large data base. • The cost of building histograms from database is significant high and prevent us from building useful histograms . • Data modification causes to lost of histogram accuracy .

Self-Tuning histogram : • Similar in structure to traditional histogram. • Builds without looking or sampling the data. • Uses “free” information from the queries results. • Can be refined in an on-line mode ( The overall cost of building self tuning histogram is very low). • Their accuracy depends on how often they are used, the more it is use , the more it is refined , the more accurate it becomes

The main steps in building the ST-histogram : • Initial the histogram. • Refining the bucket values (frequencies). • Restructuring – moving the buckets boundaries.

Initial The Histogram : • Data requirement : • B Number of histogram buckets • T Number of tuples • Min/max min and max values of attribute . • Assuming uniformity of the data distribution and initial each of the buckets as T/b tuples.

ST-histogram after initialization tuples frequency buckets

The Algorithm for refining the buckets frequencies. (second step ) begin • Finds set of k buckets that overlapping the selection range . • Let estbe the estimated results size of the selection range using histogram h . • Let act be the actual result size. • Compute the estimation error by act-est , denote by esterr . • 

 last overlapbucket 5. for i = 1 to k do 8. endfor end UpdateFreq proportion average assumption

Refinement example : act = 60 rangehigh = 25 Rangelow = 12.5 high(b1) = 15 low(b1) = 5 35 0.5 5/25

Restructuring Algorithm : • Motivation : • Frequencies are approximated by their average. Thus , high frequent value will be contain in high frequent buckets , but they may be grouped with low frequency values . • When the range of query adapts to the range of histogram bucket , no average assumption is needed.

Restructuring Algorithm : Merging step • For every consecutive runs of buckets , find the maximum differences in frequency between a bucket in the first run and the buckets in the second run. • Find the minimum of all these maximum difference , denote by mindiff. • if mindiff < m*T then Merge the two runs of buckets corresponding to mindiff into one run. Look for other runs to merge .goto line 1. endif

Restructuring Algorithm : Splitting step • k=s*b ; b – the rest of the buckets that haven’t been chosen . • Find the set with k highest frequency . • Compute the splitting extra bucket of each one by : • split(bi) = where totalfreq is the sum of all the bucket to be split and B is the number of extra bucks 8. Each buckets freq gets the old freq divided to split(bi)+1.

Restructuring example : m*T  3 S*B  2 merge1 merge2 split split 10 13 17 14 13 11 25 70 10 30 10 1 2 3 4 5 6 7 8 9 23 17 38 25 23 23 24 10 15 15 10 1 2 3 4 5 7 8 9 6

Multi-dimensional ST-histogram • Initialization : • Assuming a complete uniformity and independence. • Using existing one-dimensional ST-histograms assuming independence of the attribute .

Multi-dimensional Refining the buckets frequencies : • The refining algorithm for multi-dimensional is identical to the algorithm for one-dimensional except the two following changes : • Finding the overlap selected range, now require examining a multi-dimensional structure. • The fraction of a bucket overlapping the selection range is now equal to the volume of the region divided by the volume of the region represented by the whole bucket.

Multi-dimensional Restructuring • Merge – find the max difference in freq between any two corresponding buckets of the same line, merge if the difference within m*t . ( m<1% ). • Split – the frequency of partition ji in dimension i is compute by : Max S = 50 Max diff = 4 Max S = 60

Accuracy of ST-histogram Due to the same memory limit and the complex of MHIST-2, st-histogram have more buckets.

Adapting to Database Updates R1 – relation before update with skew =1 . R2 - Update the relation by deleting 25% of it’s tuple and inserting an equal number of tuples.

Accuracy dependence of the frequent queries.

Conclusions : • Better than assuming uniformity and independence for all values of data skew (z). • Forlowdata skew the st-histograms found to be sufficient accurate comparing to the traditional histograms. • Attractive for multi-dimensional histogram since the high cost of building them . • For high data skew, st-hist much less accurate than “Mhist-2” . • Combination between traditional hist and st-hist for all the range of data skew can yield the best of each concept .

Self-tuning Histograms Building Histograms Without Looking at Data

Self-tuning Histograms Building Histograms Without Looking at Data

Presentation Transcript

Histograms

Histograms

HISTOGRAMS

Histograms

Histograms

Histograms

Histograms

Histograms

Histograms

Histograms

Histograms

Histograms

Histograms

Self-tuning Histograms Building Histograms Without Looking at Data

Histograms

Histograms

Histograms

Histograms

HISTOGRAMS

Histograms

Histograms

Histograms