1 / 29

End of Chapter 8

End of Chapter 8. Neil Weisenfeld March 28, 2005. Outline. 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping. MCMC for Sampling from the Posterior. Markov chain Monte Carlo method

booth
Download Presentation

End of Chapter 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End of Chapter 8 Neil Weisenfeld March 28, 2005

  2. Outline • 8.6 MCMC for Sampling from the Posterior • 8.7 Bagging • 8.7.1 Examples: Trees with Simulated Data • 8.8 Model Averaging and Stacking • 8.9 Stochastic Search: Bumping

  3. MCMC for Sampling from the Posterior • Markov chain Monte Carlo method • Estimate parameters given a Bayesian model and sampling from the posterior distribution • Gibbs sampling, a form of MCMC, is like EM only sample from conditional dist rather than maximizing

  4. Gibbs Sampling • Wish to draw a sample from the joint distribution • If this is difficult, but it’s easy to simulate conditional distributions • Gibbs sampler simulates each of these • Process produces a Markov chain with stationary distribution equal to desired joint disttribution

  5. Algorithm 8.3: Gibbs Sampler • Take some initial values • for t=1,2,…: • for k=1,2,…,K generate from: • Continue step 2 until joint distribution of does not change

  6. Gibbs Sampling • Only need to be able to sample from conditional distribution, but if it is known, then: is a better estimate

  7. Gibbs sampling for mixtures • Consider latent data from EM procedure to be another parameter: • See algorithm (next slide), same as EM except sample instead of maximize • Additional steps can be added to include other informative priors

  8. Algorithm 8.4: Gibbs sampling for mixtures • Take some initial values • Repeat for t=1,2,…, • For I=1,2,…,N generate • Set • Continue step 2 until the joint distribution of doesn’t change.

  9. Figure 8.8: Gibbs Sampling from Mixtures Simplified case with fixed variances and mixing proportion

  10. Outline • 8.6 MCMC for Sampling from the Posterior • 8.7 Bagging • 8.7.1 Examples: Trees with Simulated Data • 8.8 Model Averaging and Stacking • 8.9 Stochastic Search: Bumping

  11. 8.7 Bagging • Using bootstrap to improve the estimate itself • Bootstrap mean approximately posterior average • Consider regression problem: • Bagging averages estimates over bootstrap samples to produce:

  12. Bagging, cnt’d • Point is to reduce variance of the estimate while leaving bias unchanged • Monte-Carlo estimate of “true” bagging estimate, approaching as • Bagged estimate will differ from the original estimate only when latter is adaptive or non-linear function of the data

  13. Bagging B-Spline Example • Bagging would average the curves in the lower left-hand corner at each x value.

  14. Quick Tree Intro • Can’t do. • Recursive subdivision. • Tree. • f-hat.

  15. Spam Example

  16. Bagging Trees • Each run produces different trees • Each tree may have different terminal nodes • Bagged estimate is the average prediction at x from the B trees. Prediction can be a 0/1 indicator function, in which case bagging gives a pkproportion of trees predicting class k at x.

  17. 8.7.1: Example Trees with Simulated Data • Original and 5 bootstrap-grown trees • Two classes, five features, Gaussian distribution • Y from • Bayes error 0.2 • Trees fit to 200 bootstrap samples

  18. Example Performance • High variance among trees because features have pairwise correlation 0.95. • Bagging successfully smooths out vairance and reduces test error.

  19. Where Bagging Doesn’t Help • Classifier is a single axis-oriented split. • Split is chosen along either x1or x2 in order to minimize training error. • Boosting is shown on the right.

  20. Outline • 8.6 MCMC for Sampling from the Posterior • 8.7 Bagging • 8.7.1 Examples: Trees with Simulated Data • 8.8 Model Averaging and Stacking • 8.9 Stochastic Search: Bumping

  21. Model Averaging and Stacking • More general Bayesian model averaging • Given candidate models Mm, m =1…M and a training set Z and • Bayesian prediction is weighted avg of indiv predictions with weights proportional to posterior of each model

  22. Other Averaging Strategies • Simple unweighted average of predictions (each model equally likely) • BIC: use to estimate posterior model probabilities: weight each model depending on fit and how many parameters it uses • Full Bayesian strategy:

  23. Frequentist Viewpoint of Averaging • Given a set of predictions from M models, we seek optimal weights w: • Input x is fixed and N observations in Z are distributed according to P. Solution is the linear regression of Y on the vector of model predictions:

  24. Notes of Frequentist Viewpoint • At the population level, adding models with arbitrary weights can only help. • But the population is, of course, not available • Regression over training set can be used, but this may not be ideal: model complexity not taken into account…

  25. Stacked Generalization, Stacking • Cross validated predictions avoid unfairly high weight to models with high complexity • If w restricted to vectors with one unit weight and the rest zero, model choice has smallest leave-one-out cross validation • In practice we use combined models with optimal weights: better prediction, but less interpretability

  26. Outline • 8.6 MCMC for Sampling from the Posterior • 8.7 Bagging • 8.7.1 Examples: Trees with Simulated Data • 8.8 Model Averaging and Stacking • 8.9 Stochastic Search: Bumping

  27. Stochastic Search: Bumping • Rather than average models, try to find a better single model. • Good for avoiding local minima in the fitting method. • Like bagging, draw bootstrap samples and fit model to each, but choose model that best fits the training data

  28. Stochastic Search: Bumping • Given B bootstrap samples Z*1,…, Z*B, fitting model to each yields predictions: • For squared error, choose model from bootstrap sample: • Bumping tries to move around the model space by perturbing the data.

  29. A contrived case where bumping helps • Greedy tree-based algorithm tries to split on each dimension separately, first one, then the other. • Bumping stumbles upon the right answer.

More Related