On Synopses for Distinct-Value Estimation Under Multiset Operations

On Synopses for Distinct-Value Estimation Under Multiset Operations Kevin Beyer Peter J. Haas Berthold Reinwald Yannis Sismanis IBM Almaden Research Center Rainer Gemulla Technische Universität Dresden

Introduction • Estimating # Distinct Values (DV) crucial for: • Data integration & cleaning • E.g. schema discovery, duplicate detection • Query optimization • Network monitoring • Materialized view selection for datacubes • Exact DV computation is impractical • Sort/scan/count or hash table • Problem: bad scalability • Approximate DV “synopses” • 25 year old literature • Hashing-based techniques SIGMOD 07

Motivation: A Synopsis Warehouse Full-Scale Warehouse Of Data Partitions Synopsis • Goal: discover partition characteristics & relationships to other partitions • Keys, functional dependencies, similarity metrics (Jaccard) • Similar to Bellman [DJMS02] • Accuracy challenge: small synopses sizes, many distinct values Synopsis Synopsis Warehouse of Synopses S1,1 S1,2 Sn,m combine etc S*,* S1-2,3-7 SIGMOD 07

Outline • Background on KMV synopsis • An unbiased low-variance DV estimator • Optimality • Asymptotic error analysis for synopsis sizing • Compound Partitions • Union, intersection, set difference • Multiset Difference: AKMV synopses • Deletions • Empirical Evaluation SIGMOD 07

k-min 0 1 U U ... U (1) (2) (k) 1/D hash K-Min Value (KMV) Synopsis Partition a • Hashing = dropping DVs uniformly on [0,1] • KMV synopsis: • Leads naturally to basic estimator [BJK+02] • Basic estimator: • All classic estimators approximate the basic estimator • Expected construction cost: • Space: b a X a X X X X X X X X X … e D distinct values SIGMOD 07

Contributions: New Synopses & Estimators • Better estimators for classic KMV synopses • Better accuracy: unbiased, low mean-square error • Exact error bounds (in paper) • Asymptotic error bounds for sizing the synopses • Augmented KMV synopsis (AKMV) • Permits DV estimates for compound partitions • Can handle deletions and incremental updates Synopsis A SA Combine SAop B Aop B Synopsis SB B SIGMOD 07

Unbiased DV Estimator from KMV Synopsis • Unbiased Estimator [Cohen97]: • Exact error analysis based on theory of order statistics • Asymptotically optimal as k becomes large (MLE theory) • Analysis with many DVs • Theorem: • Proof: • Show that U(i)-U(i-1) approx exponential for large D • Then use [Cohen97] • Use above formula to size synopses a priori SIGMOD 07

Outline • Background on KMV synopsis • An unbiased low-variance DV estimator • Optimality • Asymptotic error analysis for synopsis sizing • Compound Partitions • Union, intersection, set difference • Multiset Difference: AKMV synopses • Deletions • Empirical Evaluation SIGMOD 07

U (k) (Multiset) Union of Partitions k-min k-min LA LB X X X X  X X X X 0 … 1 0 … 1 k-min L X X X X X X X X X 0 … 1 • Combine KMV synopses: L=LALB • Theorem: L is a KMV synopsis of AB • Can use previous unbiased estimator: SIGMOD 07

(Multiset) Intersection of Partitions • L=LALB as with union (contains k elements) • Note: L corresponds to a uniform random sample of DVs in AB • K = # values in L that are also in D(AB) • Theorem: Can compute from LA and LB alone • K/k estimates Jaccard distance: • estimates • Unbiased estimator of #DVs in the intersection: • See paper for variance of estimator • Can extend to general compound partitions from ordinary set operations SIGMOD 07

L+E=(LE,cE) AKMV Synopsis E L+G=(LE LF,hop(cE,cF)) G=Eop F Combine AKMV Synopsis F L+F=(LF,cF) Multiset Differences: AKMV Synopsis • Augment KMV synopsis with multiplicity counters L+=(L,c) • Space: M=max multiplicity • Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+) • Unbiased DV estimator: Kg is the #positive counters • Closure property: • Can also handle deletions SIGMOD 07

0.1 0.08 0.06 Average ARE 0.04 0.02 0 SDLogLog Unbiased-KMV Sample-Counting Baseline Accuracy Comparison SIGMOD 07

500 400 300 200 100 0 0 0.10 0.05 0.15 ARE Compound Partitions Unbiased-KMV/Intersections Unbiased-KMV/Unions Unbiased-KMV/Jaccard SDLogLog/Intersections SDLogLog/Unions SDLogLog/Jaccard Frequency SIGMOD 07

Conclusions • DV estimation for scalable, flexible synopsis warehouse • Better estimators for classic KMV synopses • DV estimation for compound partitions via AKMV synopses • Closure property • Theoretical contributions • Order statistics for exact/asymptotic error analysis • Asymptotic efficiency via MLE theory A new spin on an old problem SIGMOD 07

On Synopses for Distinct-Value Estimation Under Multiset Operations

On Synopses for Distinct-Value Estimation Under Multiset Operations

Presentation Transcript

Multiset Class and its Iterator

Join Synopses for Approximate Query Answering

Constructing Optimal Wavelet Synopses

Join Synopses for Approximate Query Answering

Distinct

Deposit Value Under Pressure

Distinct items:

On Synopses for Distinct-Value Estimation Under Multiset Operations

Join Synopses for Approximate Query Answering

Volatility Estimation for Stochastic Project Value Models

More on Estimation

A paper on Join Synopses for Approximate Query Answering

Ratio estimation under SRS

Operations – II Value chain basics

Decimal Place Value and Estimation

Join Synopses for Approximate Query Answering

Perceptron-based Global Confidence Estimation for Value Prediction

Join Synopses for Approximate Query Answering

Volatility Estimation for Stochastic Project Value Models

Breeding Value Estimation

Constructing Optimal Wavelet Synopses