1 / 25

Model generalization

Model generalization. Brief summary of methods Bias, variance and complexity Brief introduction to Stacking. Brief summary. Capability to learn highly abstract representations. Need expert input to select representations. Shallow classifiers (linear machines). Parametric Nonlinear models.

sur
Download Presentation

Model generalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model generalization Brief summary of methods Bias, variance and complexity Brief introduction to Stacking

  2. Brief summary Capability to learn highly abstract representations Need expert input to select representations Shallow classifiers (linear machines) Parametric Nonlinear models Kernel machines Deep neural networks “Do not generalize well far from the training examples” Linear partition Nonlinear partition Flexible Nonlinear partition Extract higher order information

  3. Model Training data Testing data Model Testing error rate Training error rate Good performance on testing data, which is independent from the training data, is most important for a model. It serves as the basis in model selection.

  4. Test error To evaluate prediction accuracy, loss functions are needed. Continuous Y: In classification, categorical G (class label): Where , Except in rare cases (e.g. 1 nearest neighbor), the trained classifier always gives a probabilistic outcome.

  5. Test error The log-likelihood can be used as a loss-function for general response densities, such as the Poisson, gamma, exponential, log-normal and others. If Prθ(X)(Y) is the density of Y , indexed by a parameter θ(X) that depends on the predictor X, then The 2 makes the log-likelihood loss for the Gaussian distribution match squared error loss.

  6. Test error Test error: The expected loss over an INDEPENDENT test set. The expectation is taken with regard to everything that’s random - both the training set and the test set. In practice it is more feasible to estimate the testing error given a training set: Training error is the average Loss over just the training set:

  7. Test error

  8. Test error

  9. Test error Test error for categorical outcome: Training error:

  10. Goals in model building Model selection: Estimating the performance of different models; choose the best one (2) Model assessment: Estimate the prediction error of the chosen model on new data.

  11. Goals in model building Ideally, we’d like to have enough data to be divided into three sets: Training set: to fit the models Validation set: to estimate prediction error of models, for the purpose of model selection Test set: to assess the generalization error of the final model A typical split:

  12. Goals in model building What’s the difference between the validation set and the test set? The validation set is used repeatedly on all models. The model selection can chase the randomness in this set. Our selection of the model is based on this set. In a sense, there is over-fitting in terms of this set, and the error rate is under-estimated. The test set should be protected and used only once to obtain an unbiased error rate.

  13. Goals in model building In reality, there’s not enough data. How do people deal with the issue? Eliminate validation set. Draw validation set from training set. Try to achieve generalization error and model selection. (AIC, BIC, cross-validation ……) Sometimes, even omit the test set and final estimation of prediction error; publish the result and leave testing to later studies.

  14. Bias-variance trade-off In the continuous outcome case, assume The expected prediction error in regression is:

  15. Bias-variance trade-off

  16. Bias-variance trade-off Kernel smoother. Green curve-truth. Red curves – estimates based on random samples.

  17. Bias-variance trade-off K-nearest neighbor classifier: The higher the k, the lower the model complexity (estimation becomes more global, space partitioned into larger patches) Increase k, the variance term decreases, and the bias term increases. (Here x’s are assumed to be fixed; randomness only in y) Bias

  18. Bias-variance trade-off For linear model with p coefficients, The average of ||h(x0)||2 over sample values is p/N Model complexity is directly associated with p.

  19. Bias-variance trade-off

  20. Bias-variance trade-off An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Y is 0 if X1 ≤ 1/2 and 1 if X1 > 1/2, and apply k-nearest neighbors. Red: prediction error Green: squared bias Blue: variance

  21. Bias-variance trade-off An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Red: prediction error Green: squared bias Blue: variance

  22. Stacking Increase the model space Ensemble learning Combining weak learners Combining strong learners Bagging Stacking Boosting Random forest

  23. Stacking Strong learner ensembles (“Stacking” and beyond): CurrentBioinformatics, 5, (4):296-308, 2010.

  24. Stacking • Uses cross validation to assess the individual performance of prediction algorithms • Combines algorithms to produce an asymptotically optimal combination For each predictor, predict each observation in a V-fold cross-validation Find a weight vector: Combine the prediction from individual algorithms using the weights. Stat in Med. 34:106–117

  25. Stacking Lancet Respir Med. 3(1):42-52

More Related