Using Sketches to Estimate Associations

Using Sketches to Estimate Associations Ping Li Cornell Kenneth Church Microsoft Sample contingency table Original contingency table W2 ~W2 W2 ~W2 bs b as W1 a W1 ds cs c d ~W1 ~W1 DIMACS

Ken Church Albert Greenberg James Hamilton {church, albert, jamesrh}@microsoft.com On Delivering Embarrassingly Distributed Cloud ServicesHotnets-2008 Board Affordable $1B $2M 2 DIMACS

Containers:Disruptive Technology Implications for Shipping New Ships, Ports, Unions Implications for Hotnets New Data Center Designs Power/Networking Trade-offs Cost Models: Expense vs. Capital Apps: Embarrassingly Distributed Restriction on Embarrassingly Parallel Machine Models: Distributed Parallel Cluster  Parallel Cluster 3 DIMACS

Mega vs. Micro Data Centers 4 DIMACS

Related Work 220 containers in one PoP  220 in 220 PoPs 5 • http://en.wikipedia.org/wiki/Data_center • A data center can occupy one room of a building… • Servers differ greatly in size from 1U servers to large … silos • Very large data centers may use shipping containers...[2] DIMACS

Embarrassingly Distributed Probes • W1 & W2 are Shipping Containers • Lots of bandwidth within a container • But less across containers • Limited Bandwidth  Sampling Sample contingency table Original contingency table W2 ~W2 W2 ~W2 bs b as W1 a W1 ds cs c d ~W1 ~W1 DIMACS

≈ 1990 Strong: 427M (Google) Powerful: 353M (Google) Page Hits ≈ 1000x BNC freqs DIMACS

PMI + The Web: Better together Turney (and Know It All) DIMACS

“It never pays to think until you’ve run out of data” – Eric Brill Moore’s Law Constant: Data Collection Rates  Improvement Rates Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) No consistently best learner More data is better data! Quoted out of context Fire everybody and spend the money on data DIMACS

# of (English) documents D≈ 1010 . Lots of hits even for very rare words. Larger corpora  Larger counts  More signal Page Hits Estimatesby MSN and Google(August 2005) More Freq Less Freq DIMACS

Joint frequencies ought to decrease monotonically as we add more terms to the query. Caution: Estimates ≠ Actuals These are just (quick-and-dirty) estimates (not actuals) DIMACS

Rule of Thumb breaks down when there are strong interactions (Common for cases of most interest) Query Planning (Governator): Rule of Thumb DIMACS

= = = + + f D f a a+b+c+d a b c 1 2 Associations: PMI, MI, Cos, R, Cor…Summaries of Contingency Table W2 ~W2 a: # of documents that contain both WordW1and WordW2 b W1 a b: # of documents that contain WordW1but not WordW2 c ~W1 d Margins (aka doc freq) • Need just one more constraint • To compute table (& summaries) • 4 parameters: a, b, c, d • 3 constraints: f1, f2, D DIMACS

Postings  Margins (and more)(Postings aka Inverted File) Postings(w): A sorted list of doc IDs for w … This pig is so cute … … saw a flying pig … … was raining pigs and eggs … Doc #13 Doc #25 Doc #33 Assume doc IDs are random DIMACS

Conventional Random Sampling(Over Documents) Sample contingency table Original contingency table W2 ~W2 W2 ~W2 bs b as W1 a W1 ds cs c d ~W1 ~W1 Margin-Free Baseline Sample Size DIMACS

Random Sampling • Over documents • Simple & well understood • But problematic for rare events • Over postings • where f = |P| (P = postings, aka inverted file) • aka doc freq or margin Undesirable DIMACS

W2 ~W2 bs as Sketches >> Random Samples W1 ds cs ~W1 Best Undesirable Better DIMACS

Outline • Review random sampling • and introduce a running example • Sample: Sketches • A generalization of Broder’s Original Method • Sketches: • Advantages: Larger as than random sampling • Disadvantages: Estimation  more challenging • Estimation: Maximum Likelihood (MLE) • Evaluation DIMACS

Random Sampling over Documents W2 ~W2 bs as W1 ds cs ~W1 • Doc IDs are random integers between 1 and D=36 • Small circles word W1 • Smallsquares word W2 • Choose a sample size: Ds = 18. Sampling rate = Ds/D = 50% • Construct sample contingency table: • as = #|{4,15}| = 2, bs = #|{3, 7, 9, 10, 18}| = 5, • cs= #|{2,5,8}| = 3, ds = #|{1,6,11,12,13,14,17}|= 8 • Estimation: a ≈ D/Ds as • But that doesn’t take advantage of margins DIMACS

Proposed Sketches Sketch = Front of Postings Postings P1: 3 4 7 9 10 15 18 19 24 25 28 33 P2: 2 4 5 8 15 19 21 24 27 28 31 35 Throw out red Choose sample size: Ds = 18 = min(18, 21) as = |{4,15}| = 2 bs = 7 – as = 5 cs= 5 – as = 3 ds = Ds–as–bs–cs= 8 W2 ~W2 Based on blue − red bs as W1 ds cs ~W1 DIMACS

When we know the margins, We ought to use them Estimation: Maximum Likelihood (MLE) • Consider all possible contingency tables: • a, b, c & d • Select the table that maximizes the probability of observations • as, bs, cs& ds DIMACS

Problem: Too complicated. Numerical problems. Exact MLE First derivative of log likelihood gives the MLE solution DIMACS

Log likelihood function is concave unique maximum Suffice to solve g(a) = 1. Exact MLE Second derivative: PMF updating formula: DIMACS

is a cubic function of a. Exact MLE MLE solution: DIMACS

P(as,bs; a) P(as,cs; a) Approximate MLE: Maximize P(as, bs, cs; a)=P(as, bs; a)×P(as,cs; a) An Approximate MLE Suppose we were sampling from the two inverted files directly and independently. DIMACS

An Approximate MLE Convenient Closed-Form Solution • Convenient Closed-Form • Surprisingly accurate • Recommended Take log of both sides; Set derivative = 0 DIMACS

Independence Baseline Evaluation Margin-Free Baseline When we know the margins, We ought to use them Proposed Best DIMACS

Theoretical Evaluation • Not surprisingly, there is a trade-off between • Computational work: space, time • Statistical Accuracy: variance, error • Formulas state trade-off precisely in terms of sampling rate: Ds/D • Theoretical evaluation: • Proposed MLE is better than Margin Free baseline • Confirms empirical evaluation DIMACS

At web scale (D≈1010), sampling rate 10- 4 may suffice for “ordinary” words. How many samples are enough? Sampling rate to achieve cv = SE/a < 0.5 Larger D  Smaller sampling rate Cluster of 10k machines  A Single machine DIMACS

Broder’s Sketch: Original & MinwiseEstimate Resemblance (R) • Notation • Words: w1, w2 • Postings: P1, P2 • Set of doc IDs • Resemblance: R • Random Permutation: π • Minwise Sketch: • Permute doc IDs k times: πk • For each πi,let mini(P) be smallest doc ID in πi(P) • Original Sketch: • Sketches: K1, K2 • Set of doc IDs • (front of postings) • Permute doc IDs once: π • Let K=firstk(P) be the first k doc IDs in π(P) Throw out half DIMACS

Degree of freedom = 2m – (m+1), increases exponentially, suggesting margin constraints become less important as m increases. Multi-way Associations: Evaluation MSE relative improvement over margin-free baseline When we know the margins, we ought to use them Gains are larger for 2-way than multi-way DIMACS

Conclusions (1 of 2) When we know the margins, We ought to use them • Estimating Contingency Tables: • Fundamental Problem • Practical app: • Estimating Page Hits for two or more words (Governator) • Know It All: Estimating Mutual Information from Page Hits • Baselines: • Independence: Ignore interactions (Awful) • Margin-Free: Ignore postings (Wasteful) • (≈2x) Broder’s Sketch (WWW 97, STOC 98, STOC 2002) • Throws out half the sample • (≈10x) Random Projections (ACL2005, STOC2002) • Proposed Method: • Sampling: like Broder’s Sketch, but throws out less • Larger as than random sampling • Estimation: MLE (Maximum Likelihood) • MF: Estimation is easy without margin constraints • MLE: Find most likely contingency table, • Given observations: as, bs, cs, ds DIMACS

Rising Tide of Data Lifts All BoatsIf you have a lot of data, then you don’t need a lot of methodology Conclusions (2 of 2) • Recommended Approximation • Trade-off between • Computational work (space and time) and • Statistical accuracy (variance and errors) • Derived formulas for variance • Showing how trade-off depends on sampling rate • At Web scales, sampling rate (Ds/D)  10–4 • A cluster of 10k machines  A single machine DIMACS

Backup

Comparison with Broder’s Algorithm Broder’s Method has larger variance (≈2x) Because it uses only half the sketch Var(RMLE) << VAR(RB) Equal samples Proportional samples when DIMACS

Comparison with Broder’s Algorithm Ratio of variances (equal samples) Var(RMLE) << VAR(RB) DIMACS

Comparison with Broder’s Algorithm Ratio of variances (proportional samples) Var(RMLE) << VAR(RB) DIMACS

Comparison with Broder’s AlgorithmEstimation of Resemblance Broder’s method throws out half the samples  50% improvement DIMACS

Comparison with Random ProjectionsEstimation of Angle Huge Improvement DIMACS

Comparison with Random Projections10x Improvement DIMACS

Using Sketches to Estimate Associations

Using Sketches to Estimate Associations

Presentation Transcript

Multiview Sketches

How To Estimate Roofs Using Pitch Gauge

Perspective Sketches

Using the Best Science to Estimate Consumptive Use

Thumbnail Sketches

Using a Microscope to Estimate Size

Ability to Estimate

Spatio-Temporal Aggregation Using Sketches

Using Profiling Float Trajectories to Estimate Ocean Circulation

4.5 Using Graphs to Estimate Values

Instrument Sketches

Using NHTS to Estimate Activity Patterns

Section 5.2 - Using Simulation to Estimate Probabilities

Detecting Attacks in Routers Using Sketches

Using satellites to estimate US methane emissions

Using a sample to make a point estimate

Processing Data-Stream Joins Using Skimmed Sketches

Using BigSim to Estimate Application Performance

Sketches

Isometric Sketches

Using Story Points to Estimate Software Development Projects