1 / 14

On Synopses for Distinct-Value Estimation Under Multiset Operations

This research paper presents innovative techniques for estimating distinct values in data processing scenarios, focusing on scalability and accuracy. Key topics include KMV synopses, compound partitions, and unbiased estimators for better accuracy in DV estimation. The study explores practical applications like network monitoring and query optimization. Evidence-backed empirical evaluations showcase improved performance and effectiveness in handling compound partitions and deletions, contributing significant theoretical insights to the domain.

hollande
Download Presentation

On Synopses for Distinct-Value Estimation Under Multiset Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Synopses for Distinct-Value Estimation Under Multiset Operations Kevin Beyer Peter J. Haas Berthold Reinwald Yannis Sismanis IBM Almaden Research Center Rainer Gemulla Technische Universität Dresden

  2. Introduction • Estimating # Distinct Values (DV) crucial for: • Data integration & cleaning • E.g. schema discovery, duplicate detection • Query optimization • Network monitoring • Materialized view selection for datacubes • Exact DV computation is impractical • Sort/scan/count or hash table • Problem: bad scalability • Approximate DV “synopses” • 25 year old literature • Hashing-based techniques SIGMOD 07

  3. Motivation: A Synopsis Warehouse Full-Scale Warehouse Of Data Partitions Synopsis • Goal: discover partition characteristics & relationships to other partitions • Keys, functional dependencies, similarity metrics (Jaccard) • Similar to Bellman [DJMS02] • Accuracy challenge: small synopses sizes, many distinct values Synopsis Synopsis Warehouse of Synopses S1,1 S1,2 Sn,m combine etc S*,* S1-2,3-7 SIGMOD 07

  4. Outline • Background on KMV synopsis • An unbiased low-variance DV estimator • Optimality • Asymptotic error analysis for synopsis sizing • Compound Partitions • Union, intersection, set difference • Multiset Difference: AKMV synopses • Deletions • Empirical Evaluation SIGMOD 07

  5. k-min 0 1 U U ... U (1) (2) (k) 1/D hash K-Min Value (KMV) Synopsis Partition a • Hashing = dropping DVs uniformly on [0,1] • KMV synopsis: • Leads naturally to basic estimator [BJK+02] • Basic estimator: • All classic estimators approximate the basic estimator • Expected construction cost: • Space: b a X a X X X X X X X X X … e D distinct values SIGMOD 07

  6. Contributions: New Synopses & Estimators • Better estimators for classic KMV synopses • Better accuracy: unbiased, low mean-square error • Exact error bounds (in paper) • Asymptotic error bounds for sizing the synopses • Augmented KMV synopsis (AKMV) • Permits DV estimates for compound partitions • Can handle deletions and incremental updates Synopsis A SA Combine SAop B Aop B Synopsis SB B SIGMOD 07

  7. Unbiased DV Estimator from KMV Synopsis • Unbiased Estimator [Cohen97]: • Exact error analysis based on theory of order statistics • Asymptotically optimal as k becomes large (MLE theory) • Analysis with many DVs • Theorem: • Proof: • Show that U(i)-U(i-1) approx exponential for large D • Then use [Cohen97] • Use above formula to size synopses a priori SIGMOD 07

  8. Outline • Background on KMV synopsis • An unbiased low-variance DV estimator • Optimality • Asymptotic error analysis for synopsis sizing • Compound Partitions • Union, intersection, set difference • Multiset Difference: AKMV synopses • Deletions • Empirical Evaluation SIGMOD 07

  9. U (k) (Multiset) Union of Partitions k-min k-min LA LB X X X X  X X X X 0 … 1 0 … 1 k-min L X X X X X X X X X 0 … 1 • Combine KMV synopses: L=LALB • Theorem: L is a KMV synopsis of AB • Can use previous unbiased estimator: SIGMOD 07

  10. (Multiset) Intersection of Partitions • L=LALB as with union (contains k elements) • Note: L corresponds to a uniform random sample of DVs in AB • K = # values in L that are also in D(AB) • Theorem: Can compute from LA and LB alone • K/k estimates Jaccard distance: • estimates • Unbiased estimator of #DVs in the intersection: • See paper for variance of estimator • Can extend to general compound partitions from ordinary set operations SIGMOD 07

  11. L+E=(LE,cE) AKMV Synopsis E L+G=(LE LF,hop(cE,cF)) G=Eop F Combine AKMV Synopsis F L+F=(LF,cF) Multiset Differences: AKMV Synopsis • Augment KMV synopsis with multiplicity counters L+=(L,c) • Space: M=max multiplicity • Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+) • Unbiased DV estimator: Kg is the #positive counters • Closure property: • Can also handle deletions SIGMOD 07

  12. 0.1 0.08 0.06 Average ARE 0.04 0.02 0 SDLogLog Unbiased-KMV Sample-Counting Baseline Accuracy Comparison SIGMOD 07

  13. 500 400 300 200 100 0 0 0.10 0.05 0.15 ARE Compound Partitions Unbiased-KMV/Intersections Unbiased-KMV/Unions Unbiased-KMV/Jaccard SDLogLog/Intersections SDLogLog/Unions SDLogLog/Jaccard Frequency SIGMOD 07

  14. Conclusions • DV estimation for scalable, flexible synopsis warehouse • Better estimators for classic KMV synopses • DV estimation for compound partitions via AKMV synopses • Closure property • Theoretical contributions • Order statistics for exact/asymptotic error analysis • Asymptotic efficiency via MLE theory A new spin on an old problem SIGMOD 07

More Related