1 / 39

Adaptive Sampling Strategies for Evolving Datasets

Explore techniques for maintaining sample synopses of evolving datasets through reservoir and random pairing sampling methods. Dive into challenges, algorithms, and performance considerations.

Download Presentation

Adaptive Sampling Strategies for Evolving Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving DatasetsRainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)Peter J. Haas (IBM Almaden Research Center)

  2. Outline • Introduction • Deletions • Resizing • Experiments • Summary A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  3. Random Sampling • Database applications • huge data sets • complex algorithms(space & time) • Requirements • performance, performance, performance • Random sampling • approximate query answering • data mining • data stream processing • query optimization • data integration A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  4. The Problem Space • Setting • arbitrary data sets • samples of the data • evolving data • Scope of this talk • maintenance ofrandom samples Can we minimize or even avoid access to base data? A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  5. Types of Data Sets • Data sets • variation of data set size • influence on sampling Stable Growing Shrinking Goal: stable sample Goal: controlled growing sample uninteresting A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  6. Uniform Sampling • Uniform sampling • all samples of the same size are equally likely • many statistical procedures assume uniformity • flexibility • Example • a data set (also called population) • possible samples of size 2 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  7. Reservoir Sampling • Reservoir sampling • computes a uniform sample of M elements • building block for many sophisticated sampling schemes • single-scan algorithm • add the first M elements • afterwards, flip a coin • ignore the element (reject) • replace a random element in the sample (accept) • accept probability of the ith element A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  8. Reservoir Sampling (Example) • Example • sample size M = 2 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  9. Problems with Reservoir Sampling • Problems with reservoir sampling • lacks support for deletions (stable data sets) • cannot efficiently enlarge sample (growing data sets) ? A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  10. Outline • Introduction • Deletions • Resizing • Experiments • Summary A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  11. Naïve/Prior Approaches Algorithm Technique Comments (RS with deletions) conduct deletions, continue with smaller sample unstable Naïve use insertions to immediately refill the sample not uniform Backing sample let sample size decrease, but occasionally recompute expensive, unstable CAR(WOR) immediately sample from base data to refill the sample stable but expensive Bernoulli s. with purging “coin flip” sampling with deletions, purge if too large inexpensive but unstable Passive sampling developed for data streams (sliding windows only) special case of our RP algorithm Distinct-value sampling tailored for multiset populations expensive, low space efficiency in our setting A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  12. Random Pairing • Random pairing • compensates deletions with arriving insertions • corrects inclusion probabilies • General idea (insertion) • no uncompensated deletions  reservoir sampling • otherwise, • randomly select an uncompensated deletion (partner) • compensate it: Was it in the sample? • yes  add arriving element to sample • no  ignore arriving element A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  13. Random Pairing • Example A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  14. Random Pairing • Details of the algorithm • keeping history of deleted items is expensive, but: • maintenance of two counters suffices • correctness proof is in the paper A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  15. Outline • Introduction • Deletions • Resizing • Experiments • Summary A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  16. Growing Data Sets • The problem • growing data set Data set Random pairing growing data set stable sample sampling fraction decreases A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  17. A Negative Result • Negative result • There is no resizing algorithm which can enlarge a bounded-size sample without ever accessing base data. • Example • data set • samples of size 2 • new data set • samples of size 3 Not uniform! A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  18. Resizing • Goal • efficiently increase sample size • stay within an upper bound at all times • General idea • convert sample to Bernoulli sample • continue Bernoulli sampling until new sample size is reached • convert back to reservoir sample • Optimally balance cost • cost of base data accesses (in step 1) • time to reach new sample size (in step 2) A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  19. Resizing • Bernoulli sampling • uniform sampling scheme • each tuple is added to the sample with probability q • sample size follows binomial distribution  no effective upper bound • Phase 1: Conversion to a Bernoulli sample • given q, randomly determine sample size • reuse reservoir sample to create Bernoulli sample • subsample • sample additional tuples (base data access) • choice of q • small  less base data accesses • large  more base data accesses A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  20. Resizing • Phase 2: Run Bernoulli sampling • accept new tuples with probability q • conduct deletions • stop as soon as new sample size is reached • Phase 3: Revert to Reservoir sampling • switchover is trivial • Choosing q • determines cost of Phase 1 and Phase 2 • goal: minimize total cost • base data access expensive  small q • base data access cheap  large q • details in paper A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  21. Resizing • Example • resize by 30% if sampling fraction drops below 9% • dependent on costs of accessing base data Low costs Moderate costs High costs immediate resizing combined solution degenerates to Bernoulli sampling A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  22. Outline • Introduction • Deletions • Resizing • Experiments • Summary A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  23. Total Cost • Total cost • stable dataset, 10M operations • sample size 100k, data access 10 times more expensive than sample access Base data access No base data access A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  24. Sample size • Sample size • stable dataset, size 1M • sample size 100k Base data access No base data access A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  25. Outline • Introduction • Deletions • Resizing • Experiments • Summary A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  26. Summary • Reservoir Sampling • lacks support for deletions • complete recomputation to enlarge the sample • Random Pairing • uses arriving insertions to compensate for deletions • Resizing • base data access cannot be avoided • minimizes total cost • Future work • better q for resizing • combine with existing techniques [4,8,17] to enhance flexibility, scalability A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  27. Thank you! Questions? A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  28. Backup: Bounded-Size Sampling • Why sampling? • performance, performance, performance • How much to sample? • influencing factors • storage consumption • response time • accuracy • choosing the sample size / sampling fraction • largest sample that meets storage requirements • largest sample that meets response time requirements • smallest sample that meets accuracy requirements A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  29. Backup: Bounded-Size Sampling • Example • random pairing vs. bernoulli sampling • average estimation Data set Sample size Standard error BS violates 1, 2 BS violates 3 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  30. Backup: Distinct-Value Sampling • Distinct-value sampling (optimistic setting for DV) • DV-scheme knows avg. dataset size in advance • assume no storage for counters & hash functions Sample size Execution time 10% 1000s 100s 10s 1s 100ms 10ms 0% 10% 0% 10% RP has better memory utilization RP is significantly faster A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  31. Backup: RS With Deletions • Reservoir sampling with deletions • conduct deletions, continue with smaller sample size A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  32. Backup: Backing Sample • Evaluation • data set consists of 1 million elements (on average) • 100k sample, clustered insertions/deletions Data set Reservoir sampling Backing sample stable sample is empty eventually expensive, unstable A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  33. Backup: An Incorrect Approach • Idea • use arriving insertions to refill the sample Not uniform! A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  34. Backup: Random Pairing • Evaluation • data set consists of 1 million elements (on average) • 100k sample, clustered insertions/deletions Data set Reservoir sampling Random pairing stable sample gets emtpy eventually no base data access! A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  35. Backup: Average Sample Size • Average sample size • stable dataset, 10M operations • sample size 100k A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  36. Backup: Average Sample Size With Clustered Insertions/Deletions • Average sample size with clustered insertions/deletions • stable dataset, size 10M, ~8M operations • sample size 100k A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  37. Backup: Cost • Cost • stable dataset, 10M operations • sample size 100k A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  38. Backup: Cost With Clustered Insertions/Deletions • Cost with clustered insertions/deletions • stable dataset, size 10M, ~8M operations • sample size 100k A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

  39. Backup: Resizing (Value of q) • Resizing • enlarge sample from 100k to 200k • base data access 10ms, arrival rate 1ms A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

More Related