1 / 50

Regularization

Regularization. Jia-Bin Huang Virginia Tech. ECE-5424G / CS-5824. Spring 2019. Administrative. Women in Data Science Blacksburg Location: Holtzman Alumni Center Welcome , 3:30 - 3:40, Assembly hall

nezs
Download Presentation

Regularization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regularization Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

  2. Administrative • Women in Data Science Blacksburg • Location: Holtzman Alumni Center • Welcome, 3:30 - 3:40, Assembly hall • Keynote Speaker: MilindaLakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall • Career Panel, 4:05 - 5:00, hall • Break , 5:00 - 5:20, Grand hallAssembly • Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall • Dinner with breakout discussion groups, 5:45 - 7:00, Museum • Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall • Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

  3. k-NN (Classification/Regression) • Model • Cost function None • Learning Do nothing • Inference , where

  4. Linear regression (Regression) • Model • Cost function • Learning 1) Gradient descent: Repeat {} 2) Solving normal equation • Inference

  5. Naïve Bayes (Classification) • Model • Cost function Maximum likelihood estimation: Maximum a posteriori estimation : • Learning (Discrete ) (Continuous )mean , variance , • Inference

  6. Logistic regression (Classification) • Model • Cost function • Learning Gradient descent: Repeat {} • Inference

  7. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

  8. How about MAP? • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)

  9. Prior • Common choice of : • Normal distribution, zero mean, identity covariance • “Pushes” parameters towards zeros • Corresponds to Regularization • Helps avoid very large weights and overfitting Slide credit: Tom Mitchell

  10. MLE vs. MAP • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)

  11. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

  12. Multi-class classification • Email foldering/taggning: Work, Friends, Family, Hobby • Medical diagrams: Not ill, Cold, Flu • Weather: Sunny, Cloudy, Rain, Snow Slide credit: Andrew Ng

  13. Binary classification Multiclass classification

  14. One-vs-all (one-vs-rest) Class 1: Class 2: Class 3: Slide credit: Andrew Ng

  15. One-vs-all • Train a logistic regression classifier for each class to predict the probability that • Given a new input , pick the class that maximizes Slide credit: Andrew Ng

  16. Discriminative Approach Ex: Logistic regression Estimate directly (Or a discriminant function: e.g., SVM) Prediction Generative Approach Ex: Naïve Bayes Estimate and Prediction

  17. Further readings • Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf • Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

  18. Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

  19. Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

  20. Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 Slide credit: Andrew Ng

  21. Overfitting • If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very wellbut fail to generalize to new examples (predict prices on new examples). Slide credit: Andrew Ng

  22. Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 High bias High variance Slide credit: Andrew Ng

  23. Bias-Variance Tradeoff • Bias: difference between what you expect to learn and truth • Measures how well you expect to represent true solution • Decreases with more complex model • Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset • Increases with more complex model

  24. High variance Low variance Low bias High bias

  25. Bias–variance decomposition • Training set • We want that minimizes https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

  26. Overfitting Age Age Age Underfitting Overfitting Tumor Size Tumor Size Tumor Size Slide credit: Andrew Ng

  27. Addressing overfitting • size of house • no. of bedrooms • no. of floors • age of house • average income in neighborhood • kitchen size Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng

  28. Addressing overfitting • 1. Reduce number of features. • Manually select which features to keep. • Model selection algorithm (later in course). • 2. Regularization. • Keep all the features, but reduce magnitude/values of parameters . • Works well when we have a lot of features, each of which contributes a bit to predicting . Slide credit: Andrew Ng

  29. Overfitting Thriller • https://www.youtube.com/watch?v=DQWI1kvmwRg

  30. Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

  31. Intuition • Suppose we penalize and make really small. Price ($)in 1000’s Price ($)in 1000’s Size in feet^2 Size in feet^2 Slide credit: Andrew Ng

  32. Regularization. • Small values for parameters • “Simpler” hypothesis • Less prone to overfitting • Housing: • Features: • Parameters: Slide credit: Andrew Ng

  33. Regularization Price ($)in 1000’s : Regularization parameter Size in feet^2 Slide credit: Andrew Ng

  34. Question What if is set to an extremely large value (say )? • Algorithm works fine; setting to be very large can’t hurt it • Algorithm fails to eliminate overfitting. • Algorithm results in underfitting. (Fails to fit even training data well). • Gradient descent will fail to converge. Slide credit: Andrew Ng

  35. Question What if is set to an extremely large value (say )? Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng

  36. Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

  37. Regularized linear regression : Number of features is not panelized Slide credit: Andrew Ng

  38. Gradient descent (Previously) Repeat { } Slide credit: Andrew Ng

  39. Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng

  40. Comparison : Weight decay Regularized linear regression Un-regularized linear regression

  41. Normal equation Slide credit: Andrew Ng

  42. Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

  43. Regularized logistic regression • Cost function: Age Tumor Size Slide credit: Andrew Ng

  44. Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng

  45. : Lasso regularization LASSO: Least Absolute Shrinkage and Selection Operator

  46. Single predictor: Soft Thresholding Soft Thresholding operator

  47. Multiple predictors: : Cyclic Coordinate Desce For each , update with where

  48. L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

  49. Terminology

  50. Things to remember • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

More Related