1 / 49

The Bag of Little Bootstraps

This presentation discusses the Bag of Little Bootstraps, a procedure that combines subsampling and bootstrap methods for efficient estimation on distributed computing platforms. It eliminates the need for analytical rescaling, making it a favorable computational option. Empirical results show its effectiveness in estimating confidence intervals.

Download Presentation

The Bag of Little Bootstraps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Bag of Little Bootstraps with Ariel Kleiner, Purna Sarkar and Ameet Talwalkar Michael I. Jordan April 26, 2017 1

  2. Setup Observe data X1, ..., Xn Form a “parameter” estimate qn=q(X1, ..., Xn) Want to compute an assessment x of the quality of our estimate qn (e.g., a confidence region) ^ ^ 2

  3. The Unachievable Frequentist Ideal Ideally, we would • Observe many independent datasets of size n. • Compute qn on each. • Compute x based on these multiple realizations of qn. ^ ^ 3

  4. The Unachievable Frequentist Ideal Ideally, we would • Observe many independent datasets of size n. • Compute qn on each. • Compute x based on these multiple realizations of qn. ^ ^ But, we only observe one dataset of size n. 4

  5. Sampling 5

  6. Approximation 6

  7. Pretend The Sample Is The Population 7

  8. The Bootstrap (Efron, 1979) • Plug in the empirical distribution in the place of the population distribution in computing the risk 8

  9. The Bootstrap:Computational Issues • Seemingly a wonderful match to modern parallel and distributed computing platforms • But the expected number of distinct points in a bootstrap resample is ~ 0.632n • e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB • Can’t feasibly send resampled datasets of this size to distributed servers • Even if one could, can’t compute the estimate locally on datasets this large 9

  10. Subsampling (Politis, Romano & Wolf, 1999) n 10

  11. Subsampling n b 11

  12. Subsampling • There are many subsets of size b < n • Choose some sample of them and apply the estimator to each • This yields fluctuations of the estimate, and thus error bars • But a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (they’ll be too large) • Need to analytically correct the error bars 12

  13. Subsampling Summary of algorithm: • Repeatedly subsample b < n points without replacement from the original dataset of size n • Compute q*b on each subsample • Compute x based on these multiple realizations of q*b • Analytically correct to produce final estimate of x for qn The need for analytical correction makes subsampling less automatic than the bootstrap Still, much more favorable computational profile than bootstrap Let’s try it out in practice… 13

  14. Empirical Results:Bootstrap and Subsampling • Multivariate linear regression with d = 100 and n = 50,000 on synthetic data. • x coordinates sampled independently from StudentT(3). • y = wTx + e, where w in Rd is a fixed weight vector and e is Gaussian noise. • Estimate qn = wn in Rd via least squares. • Compute a marginal confidence interval for each component of wn and assess accuracy via relative mean (across components) absolute deviation from true confidence interval size. • For subsampling, use b(n) = ngfor various values of g. • Similar results obtained with Normal and Gamma data generating distributions, as well as if estimate a misspecified model. 14

  15. Empirical Results:Bootstrap and Subsampling 15

  16. Bag of Little Bootstraps • I’ll now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds 16

  17. Bag of Little Bootstraps • I’ll now discuss a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds • It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms 17

  18. Bag of Little Bootstraps • I’ll now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds • It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms • But, like the bootstrap, it doesn’t require analytical rescaling 18

  19. Towards the Bag of Little Bootstraps n b 19

  20. Towards the Bag of Little Bootstraps b 20

  21. Approximation 21

  22. Pretend the Subsample is the Population 22

  23. Pretend the Subsample is the Population • And bootstrap the subsample! • This means resamplingntimes with replacement, not btimes as in subsampling 23

  24. The Bag of Little Bootstraps (BLB) • The subsample contains only b points, and so the resulting empirical distribution has its support on b points • But we can (and should!) resample it with replacement n times, not b times • Doing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary! • Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging) 24

  25. The Bag of Little Bootstraps (BLB) 25

  26. Bag of Little Bootstraps (BLB)Computational Considerations A key point: • Resources required to compute q generally scale in number of distinct data points • This is true of many commonly used estimation algorithms (e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.) • Use weighted representation of resampled datasets to avoid physical data replication Example: if original dataset has size 1 TB with each data point 1 MB, and we take b(n) = n0.6, then expect • subsampled datasets to have size ~ 4 GB • resampled datasets to have size ~ 4 GB (in contrast, bootstrap resamples have size ~ 632 GB) 26

  27. Empirical Results:Bag of Little Bootstraps (BLB) 27

  28. Empirical Results:Bag of Little Bootstraps (BLB) 28

  29. BLB: Theoretical Results BLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that q is Hadamard differentiable and x is continuous), the output of BLB converges to the population value of x as n, b approach ∞. 29

  30. BLB: Theoretical ResultsHigher-Order Correctness Assume: • q is a studentized statistic. • x(Qn(P)), the population value of x for qn, can be written as where the pk are polynomials in population moments. • The empirical version of x based on resamples of size n from a single subsample of size b can also be written as where the are polynomials in the empirical moments of subsample j. • b ≤ n and 30

  31. BLB: Theoretical ResultsHigher-Order Correctness Then: 31

  32. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems Sameer Agarwal, Michael Jordan, Ariel Kleiner, Samuel Madden, Henry Milner, Barzan Mozafari, Ion Stoica, & Ameet Talwalkar UC Berkeley M I T 32

  33. Approximate Query Processing A± ε SELECT foo (*) FROM TABLE WITHIN 2 Query Interface Error Estimation Query Execution Data Subsets/ Synopsis 33

  34. Problem: Error Estimation is Unreliable Closed Forms Bootstrap 69,428 Hive Aggregate Queries (Feb ’13) 34

  35. Problem: Error Estimation is Unreliable Closed Forms Inapplicable Bootstrap 69,428 Hive Aggregate Queries (Feb ’13) 35

  36. Problem: Error Estimation is Unreliable Closed Forms Correct Estimation Bootstrap 69,428 Hive Aggregate Queries (Feb ’13) 36

  37. Problem: Error Estimation is Unreliable Closed Forms Under Estimation Bootstrap 69,428 Hive Aggregate Queries (Feb ’13) 37

  38. Problem: Error Estimation is Unreliable Closed Forms Bootstrap Over Estimation 69,428 Hive Aggregate Queries (Feb ’13) 38

  39. Problem: Error Estimation is Unreliable Closed Forms Bootstrap 18,321 Hive Aggregate Queries (Feb ‘13) 39

  40. Problem: Error Estimation is Unreliable Closed Forms Inapplicable Bootstrap 18,321 Hive Aggregate Queries (Feb ‘13) 40

  41. Problem: Error Estimation is Unreliable Closed Forms Correct Estimation Bootstrap 18,321 Hive Aggregate Queries (Feb ‘13) 41

  42. Problem: Error Estimation is Unreliable Closed Forms Under Estimation Bootstrap 18,321 Hive Aggregate Queries (Feb ‘13) 42

  43. Problem: Error Estimation is Unreliable Closed Forms Bootstrap Over Estimation 18,321 Hive Aggregate Queries (Feb ‘13) 43

  44. Problem: Error Estimation is Unreliable 96% Closed Forms Over Estimates 96.8% Bootstrap 69,428 Hive Aggregate Queries (Feb ’13) 44

  45. Problem: Error Estimation is Unreliable 97% Closed Forms Over Estimates 97.2% Bootstrap 18,321 Hive Aggregate Queries (Feb ‘13) 45

  46. Overall Query Execution Query Execution Response Time (s) 46

  47. Overall Query Execution Error Estimation Overhead Response Time (s) 47

  48. Overall Query Execution Diagnostics Overhead Response Time (s) 48

  49. Summary: Query Execution A±ε SELECT foo (*) FROM TABLE WITHIN 2 Y/N Query Interface Error Estimation Query Exection Data Storage 49

More Related