1 / 20

Self-tuning Histograms Building Histograms Without Looking at Data

Self-tuning Histograms Building Histograms Without Looking at Data . By Ashraf Aboulnage & Surajit Chaudhuri Represents By Miller Ofer. Traditional histograms. Histograms impose little cost at queries , especially in a large data base.

tannar
Download Presentation

Self-tuning Histograms Building Histograms Without Looking at Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-tuning HistogramsBuilding Histograms Without Looking at Data By Ashraf Aboulnage & Surajit Chaudhuri Represents By Miller Ofer

  2. Traditional histograms • Histograms impose little cost at queries , especially in a large data base. • The cost of building histograms from database is significant high and prevent us from building useful histograms . • Data modification causes to lost of histogram accuracy .

  3. Self-Tuning histogram : • Similar in structure to traditional histogram. • Builds without looking or sampling the data. • Uses “free” information from the queries results. • Can be refined in an on-line mode ( The overall cost of building self tuning histogram is very low). • Their accuracy depends on how often they are used, the more it is use , the more it is refined , the more accurate it becomes

  4. The main steps in building the ST-histogram : • Initial the histogram. • Refining the bucket values (frequencies). • Restructuring – moving the buckets boundaries.

  5. Initial The Histogram : • Data requirement : • B Number of histogram buckets • T Number of tuples • Min/max min and max values of attribute . • Assuming uniformity of the data distribution and initial each of the buckets as T/b tuples.

  6. ST-histogram after initialization tuples frequency buckets

  7. The Algorithm for refining the buckets frequencies. (second step ) begin • Finds set of k buckets that overlapping the selection range . • Let estbe the estimated results size of the selection range using histogram h . • Let act be the actual result size. • Compute the estimation error by act-est , denote by esterr . • 

  8. last overlapbucket 5. for i = 1 to k do 8. endfor end UpdateFreq proportion average assumption

  9. Refinement example : act = 60 rangehigh = 25 Rangelow = 12.5 high(b1) = 15 low(b1) = 5 35 0.5 5/25

  10. Restructuring Algorithm : • Motivation : • Frequencies are approximated by their average. Thus , high frequent value will be contain in high frequent buckets , but they may be grouped with low frequency values . • When the range of query adapts to the range of histogram bucket , no average assumption is needed.

  11. Restructuring Algorithm : Merging step • For every consecutive runs of buckets , find the maximum differences in frequency between a bucket in the first run and the buckets in the second run. • Find the minimum of all these maximum difference , denote by mindiff. • if mindiff < m*T then Merge the two runs of buckets corresponding to mindiff into one run. Look for other runs to merge .goto line 1. endif

  12. Restructuring Algorithm : Splitting step • k=s*b ; b – the rest of the buckets that haven’t been chosen . • Find the set with k highest frequency . • Compute the splitting extra bucket of each one by : • split(bi) = where totalfreq is the sum of all the bucket to be split and B is the number of extra bucks 8. Each buckets freq gets the old freq divided to split(bi)+1.

  13. Restructuring example : m*T  3 S*B  2 merge1 merge2 split split 10 13 17 14 13 11 25 70 10 30 10 1 2 3 4 5 6 7 8 9 23 17 38 25 23 23 24 10 15 15 10 1 2 3 4 5 7 8 9 6

  14. Multi-dimensional ST-histogram • Initialization : • Assuming a complete uniformity and independence. • Using existing one-dimensional ST-histograms assuming independence of the attribute .

  15. Multi-dimensional Refining the buckets frequencies : • The refining algorithm for multi-dimensional is identical to the algorithm for one-dimensional except the two following changes : • Finding the overlap selected range, now require examining a multi-dimensional structure. • The fraction of a bucket overlapping the selection range is now equal to the volume of the region divided by the volume of the region represented by the whole bucket.

  16. Multi-dimensional Restructuring • Merge – find the max difference in freq between any two corresponding buckets of the same line, merge if the difference within m*t . ( m<1% ). • Split – the frequency of partition ji in dimension i is compute by : Max S = 50 Max diff = 4 Max S = 60

  17. Accuracy of ST-histogram Due to the same memory limit and the complex of MHIST-2, st-histogram have more buckets.

  18. Adapting to Database Updates R1 – relation before update with skew =1 . R2 - Update the relation by deleting 25% of it’s tuple and inserting an equal number of tuples.

  19. Accuracy dependence of the frequent queries.

  20. Conclusions : • Better than assuming uniformity and independence for all values of data skew (z). • Forlowdata skew the st-histograms found to be sufficient accurate comparing to the traditional histograms. • Attractive for multi-dimensional histogram since the high cost of building them . • For high data skew, st-hist much less accurate than “Mhist-2” . • Combination between traditional hist and st-hist for all the range of data skew can yield the best of each concept .

More Related