260 likes | 270 Views
Faculty of Computer Science, Institute System Architecture, Database Technology Group. Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden). Outline. Introduction Logging Schemes
E N D
Faculty of Computer Science, Institute System Architecture, Database Technology Group Deferred Maintenance of Disk-Based Random SamplesRainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)
Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples
Random Sampling • Analytical databases • huge data sets • complex algorithms • Requirements • Performance, performance, performance! • Random sampling • approximate query answering • data mining • data stream processing • query optimization • data integration Deferred Maintenance of Disk-Based Random Samples
Offline Sampling • Precomputed samples • pros • avoid access to base data • used multiple times • arbitrary base data • versatile • cons • maintenance!!! • Disk-based samples • many, large samples stored on disk • crash safe • typically space-restricted • challenges • sequential access is faster • blocking of data Deferred Maintenance of Disk-Based Random Samples
Basics: Reservoir Sampling • Sampling with space-constraints • maintain a sample (reservoir) of M tuples • add the first M tuples • afterwards, throw a dice • ignore the tuple (reject) • replace a random tuple in the sample (accept) • accept probability controls sampling scheme • building block for many sophisticated sampling schemes • Example • dataset with 50 tuples (M=5) Deferred Maintenance of Disk-Based Random Samples
Evolution of the Sample Random I/O!!! Deferred Maintenance of Disk-Based Random Samples
Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples
Full Logging • Full Log • track all changes • log is written sequentially • log contains more information than needed Deferred Maintenance of Disk-Based Random Samples
Candidate Logging • Candidate log • track only changes which affect the sample • log is written sequentially • smaller logs How to implement Candidate Refresh? Deferred Maintenance of Disk-Based Random Samples
Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples
Naive Refresh • Naive refresh • scan log file sequentially • write each element of the log to a random position in the sample • No improvement at all! • random access to sample • some elements are written more than once Deferred Maintenance of Disk-Based Random Samples
Avoiding Multiple Writes • Observation • each candidate can be overwritten by subsequent candidates only • last candidate is never overwritten • Approach • scan log in reverse order • write only tuples which have not been written before Deferred Maintenance of Disk-Based Random Samples
Avoiding Multiple Writes • Probability of overwrites • In general • k tuples written to sample (k=0…5) • probability of overwrite: pk = (M-k)/M • number of skipped tuples: P(Xk=x)=(1-pk)x pk (k>0) • X5= • here: X1=0, X2=1, X3=1, X4=6 Deferred Maintenance of Disk-Based Random Samples
Nomem Refresh • Nomem Refresh (Phase 1) • dry run: generate X4,…,X1 in advance • reset pseudo-random number generator and generate same sequence again • start at: |C|-X indexes of log file are generated Deferred Maintenance of Disk-Based Random Samples
Nomem Refresh • Naive update of sample • read generated indexes of the log • write it to a random (free) position in the sample • drawbacks • free positions have to be maintained • random access to the sample Deferred Maintenance of Disk-Based Random Samples
Nomem Refresh • Nomem Refresh (Phase 2) • general idea: order of the tuples in sample is unimportant • algorithm • (re-)generate next position in the log (6, 8,10,11) • generate next position in the sample (1, 2, 3, 5) • read from log, write to sample Deferred Maintenance of Disk-Based Random Samples
Nomem Refresh • Properties • log file is read sequentially • sample is written sequentially • no overwrites • no memory consumption • works on full logs as well (DBMS!) Deferred Maintenance of Disk-Based Random Samples
Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples
Experiments • Number of operations & execution time • sample size: 1 million tuples • refresh period: 1 million operations Deferred Maintenance of Disk-Based Random Samples
Experiments • Refresh period & execution time • sample size: 1 million tuples • number of operations: 100 million Deferred Maintenance of Disk-Based Random Samples
Outline • Introduction • Logging Schemes • Refresh Algorithms • Performance • Summary & Outlook Deferred Maintenance of Disk-Based Random Samples
Summary & Outlook • Logging schemes • full logs: often found in database systems • candidate logs: reduce log file size • Nomem Refresh • fast incremental refresh • sequential disk access only • no memory consumption • works with full and candidate logs • Future work • more detailed discussion of updates & deletions Deferred Maintenance of Disk-Based Random Samples
Thank you! Questions? Deferred Maintenance of Disk-Based Random Samples
Extensions • Extensions • nomem refresh for full logs (DBMS!) • dry run: compute candidates, count their number • reset random number generator • add skips of Nomem Refresh and Reservoir Sampling • deletions and updates • store deletions and updates separately • process delete and update log first • run Nomem Refresh on the insert log • requires disjoint logs Deferred Maintenance of Disk-Based Random Samples
Experiments • Comparison with the Geometric File • sample size: 1 million tuples • number of operations: 100 million Deferred Maintenance of Disk-Based Random Samples
Experiments • Computational overhead • sample size: 1 million tuples Deferred Maintenance of Disk-Based Random Samples