1 / 26

Faculty of Computer Science, Institute System Architecture, Database Technology Group

Faculty of Computer Science, Institute System Architecture, Database Technology Group. Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden). Outline. Introduction Logging Schemes

orth
Download Presentation

Faculty of Computer Science, Institute System Architecture, Database Technology Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faculty of Computer Science, Institute System Architecture, Database Technology Group Deferred Maintenance of Disk-Based Random SamplesRainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)

  2. Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples

  3. Random Sampling • Analytical databases • huge data sets • complex algorithms • Requirements • Performance, performance, performance! • Random sampling • approximate query answering • data mining • data stream processing • query optimization • data integration Deferred Maintenance of Disk-Based Random Samples

  4. Offline Sampling • Precomputed samples • pros • avoid access to base data • used multiple times • arbitrary base data • versatile • cons • maintenance!!! • Disk-based samples • many, large samples  stored on disk • crash safe • typically space-restricted • challenges • sequential access is faster • blocking of data Deferred Maintenance of Disk-Based Random Samples

  5. Basics: Reservoir Sampling • Sampling with space-constraints • maintain a sample (reservoir) of M tuples • add the first M tuples • afterwards, throw a dice • ignore the tuple (reject) • replace a random tuple in the sample (accept) • accept probability controls sampling scheme • building block for many sophisticated sampling schemes • Example • dataset with 50 tuples (M=5) Deferred Maintenance of Disk-Based Random Samples

  6. Evolution of the Sample  Random I/O!!! Deferred Maintenance of Disk-Based Random Samples

  7. Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples

  8. Full Logging • Full Log • track all changes • log is written sequentially • log contains more information than needed Deferred Maintenance of Disk-Based Random Samples

  9. Candidate Logging • Candidate log • track only changes which affect the sample • log is written sequentially • smaller logs How to implement Candidate Refresh? Deferred Maintenance of Disk-Based Random Samples

  10. Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples

  11. Naive Refresh • Naive refresh • scan log file sequentially • write each element of the log to a random position in the sample • No improvement at all! • random access to sample • some elements are written more than once Deferred Maintenance of Disk-Based Random Samples

  12. Avoiding Multiple Writes • Observation • each candidate can be overwritten by subsequent candidates only • last candidate is never overwritten • Approach • scan log in reverse order • write only tuples which have not been written before Deferred Maintenance of Disk-Based Random Samples

  13. Avoiding Multiple Writes • Probability of overwrites • In general • k tuples written to sample (k=0…5) • probability of overwrite: pk = (M-k)/M • number of skipped tuples: P(Xk=x)=(1-pk)x pk (k>0) • X5= • here: X1=0, X2=1, X3=1, X4=6 Deferred Maintenance of Disk-Based Random Samples

  14. Nomem Refresh • Nomem Refresh (Phase 1) • dry run: generate X4,…,X1 in advance • reset pseudo-random number generator and generate same sequence again • start at: |C|-X  indexes of log file are generated Deferred Maintenance of Disk-Based Random Samples

  15. Nomem Refresh • Naive update of sample • read generated indexes of the log • write it to a random (free) position in the sample • drawbacks • free positions have to be maintained • random access to the sample Deferred Maintenance of Disk-Based Random Samples

  16. Nomem Refresh • Nomem Refresh (Phase 2) • general idea: order of the tuples in sample is unimportant • algorithm • (re-)generate next position in the log (6, 8,10,11) • generate next position in the sample (1, 2, 3, 5) • read from log, write to sample  Deferred Maintenance of Disk-Based Random Samples

  17. Nomem Refresh • Properties • log file is read sequentially • sample is written sequentially • no overwrites • no memory consumption • works on full logs as well (DBMS!) Deferred Maintenance of Disk-Based Random Samples

  18. Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples

  19. Experiments • Number of operations & execution time • sample size: 1 million tuples • refresh period: 1 million operations Deferred Maintenance of Disk-Based Random Samples

  20. Experiments • Refresh period & execution time • sample size: 1 million tuples • number of operations: 100 million Deferred Maintenance of Disk-Based Random Samples

  21. Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples

  22. Summary & Outlook • Logging schemes • full logs: often found in database systems • candidate logs: reduce log file size • Nomem Refresh • fast incremental refresh • sequential disk access only • no memory consumption • works with full and candidate logs • Future work • more detailed discussion of updates & deletions Deferred Maintenance of Disk-Based Random Samples

  23. Thank you! Questions? Deferred Maintenance of Disk-Based Random Samples

  24. Extensions • Extensions • nomem refresh for full logs (DBMS!) • dry run: compute candidates, count their number • reset random number generator • add skips of Nomem Refresh and Reservoir Sampling • deletions and updates • store deletions and updates separately • process delete and update log first • run Nomem Refresh on the insert log • requires disjoint logs Deferred Maintenance of Disk-Based Random Samples

  25. Experiments • Comparison with the Geometric File • sample size: 1 million tuples • number of operations: 100 million Deferred Maintenance of Disk-Based Random Samples

  26. Experiments • Computational overhead • sample size: 1 million tuples Deferred Maintenance of Disk-Based Random Samples

More Related