Fast Incremental Maintenance of Approximate histograms :

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala (Bell Laboratories) Presented by: Amrita Tamrakar CSE 6392 09-feb-2006

Introduction • What is a histogram? • Issues in Histogram maintenance • Novel concept of “Backing sample” • Types of approximate histograms • Incremental maintenance of approx histograms • Challenges and solutions • Conclusion

What is a histogram? Frequency • maintained to approximate the distribution of data in the attributes • constructed by partitioning the data into mutually disjoint subsets • Frequency as y axis and the data intervals as x axis • Oracle, DB2, SQLserver, Sybase, Informix… Data value interval http://www.shodor.org/interactivate/activities/histogram/

History of Histogram Equi-width histogram Compressed histogram Learn more on Histogram

Issues on Histogram Maintenance • precomputed on underlying data • Stored in main memory , less overhead • What about the maintenance ?? • Database is modified • Query is changed(?) • Outdated histogram • Does periodic updates solve the problem? • Recomputing from the scratch • Poor estimation during the in-between period • What’s the solution ?

The solution to outdated histograms • Maintain Approximate histogram in presence of database updates • Split and merge technique for quick adjustment • “Backing sample” stored in secondary mm

Backing Sample • Only row id and the necessary attributes • At any time, backing sample = random sample • No entire table scan • Records in Consecutive disk blocks Histogram Backing sample (100KB) 2KB Main memory Relation (20GB)

How to maintain a backing sample? • During insertions • Reservoir sampling technique • Obtain sample of data from a single scan without a priori knowledge of no of tuples. • Length of random skip chosen such that each tuple is likely to be in the reservoir. 1 First n 2 MaintainBackingSample Skip random no of record and replace n n+1

How to maintain a backing sample? • During modification • Modify if tuple present in sample • During Deletion • Remove from the sample • If sample size decrease below lower bound L, then recompute from disk.

Maintain approximate Histograms : Different Classes of Histograms • Equidepth histograms • No. of tuples in each bucket is same • Contiguous ranges of attribute values Frequency of occurrence Data value

Different Classes of Histograms • Compressed (V,F) histogram • N highest frequencies stored in singleton buckets • For other values, use equi-depth histogram • Both histograms needs to store for each bucket • The largest value in the bucket B.maxval • The Count B.count • Approximate histograms are calculated from the random sample of the Relation • How to maintain these histograms?

Fast Incremental maintenance of approximate equi-depth histograms • During Insertion • Maintain a threshold (T) upper bound • If no of tuples < T, insertion will increment the bucket count. • Else recompute the histogram • Split and merge algorithm • Reduce the no. of recomputations from the sample • When bucket count reaches T, instead of recomputing split the bucket in half. • But maintain the number of bucket as fixed by merging two buckets whose total count<T

Split n merge algorithm Insert threshold

To handle modify and delete • Deletion can lower the bucket count • Maintain a Tl as lower threshold • Merge if below threshold • Split bucket with largest count Delete threshold

Fast Incremental maintenance of approximate compressed histograms • Values with high frequencies can span more than one bucket – replace by single bucket with single count –singleton buckets • Construct compressed histogram on the sample and scale it by N/k factor. During insertions • If the count doesn’t exceed threshold, add to the bucket, else update bucket boundaries

Challenges to maintain compressed histograms • New values may lead to data skew, which may lead to new singleton buckets • Values may not belong to singleton buckets if tuples increase in equi-depth buckets • Number of equi-depth buckets needs adjustment • No. of tuples in equi-depth buckets needs adjustment

Solutions to the challenges • Large number of same value will cause an equi-depth bucket to split but the adjacent boundaries will have same value, hint create singleton bucket for that value • allow singleton buckets with small counts to be merged back into equi-depth buckets. • Split and merge technique to control imbalance between equi-depth buckets and their tuples without recomputation

To handle deletion and modification • Deletion can decrease number of tuples in a bucket relative to another bucket , making a singleton bucket • can drop a bucket count to the lower threshold TL. • What to do? • Merge the pair with smallest combined count and split the bucket with largest count • Else recompute from backing sample

Conclusion • Backing sample • Incremental maintenance of equi-depth and compressed histograms • Split and merge technique to reduce access to backing sample

Use of histograms in Commercial database

Fast Incremental Maintenance of Approximate histograms :