630 likes | 835 Views
Regularization. Jia-Bin Huang Virginia Tech. ECE-5424G / CS-5824. Spring 2019. Administrative. Women in Data Science Blacksburg Location: Holtzman Alumni Center Welcome , 3:30 - 3:40, Assembly hall
E N D
Regularization Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019
Administrative • Women in Data Science Blacksburg • Location: Holtzman Alumni Center • Welcome, 3:30 - 3:40, Assembly hall • Keynote Speaker: MilindaLakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall • Career Panel, 4:05 - 5:00, hall • Break , 5:00 - 5:20, Grand hallAssembly • Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall • Dinner with breakout discussion groups, 5:45 - 7:00, Museum • Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall • Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
k-NN (Classification/Regression) • Model • Cost function None • Learning Do nothing • Inference , where
Linear regression (Regression) • Model • Cost function • Learning 1) Gradient descent: Repeat {} 2) Solving normal equation • Inference
Naïve Bayes (Classification) • Model • Cost function Maximum likelihood estimation: Maximum a posteriori estimation : • Learning (Discrete ) (Continuous )mean , variance , • Inference
Logistic regression (Classification) • Model • Cost function • Learning Gradient descent: Repeat {} • Inference
Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
How about MAP? • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)
Prior • Common choice of : • Normal distribution, zero mean, identity covariance • “Pushes” parameters towards zeros • Corresponds to Regularization • Helps avoid very large weights and overfitting Slide credit: Tom Mitchell
MLE vs. MAP • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)
Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
Multi-class classification • Email foldering/taggning: Work, Friends, Family, Hobby • Medical diagrams: Not ill, Cold, Flu • Weather: Sunny, Cloudy, Rain, Snow Slide credit: Andrew Ng
Binary classification Multiclass classification
One-vs-all (one-vs-rest) Class 1: Class 2: Class 3: Slide credit: Andrew Ng
One-vs-all • Train a logistic regression classifier for each class to predict the probability that • Given a new input , pick the class that maximizes Slide credit: Andrew Ng
Discriminative Approach Ex: Logistic regression Estimate directly (Or a discriminant function: e.g., SVM) Prediction Generative Approach Ex: Naïve Bayes Estimate and Prediction
Further readings • Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf • Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression
Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression
Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 Slide credit: Andrew Ng
Overfitting • If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very wellbut fail to generalize to new examples (predict prices on new examples). Slide credit: Andrew Ng
Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 High bias High variance Slide credit: Andrew Ng
Bias-Variance Tradeoff • Bias: difference between what you expect to learn and truth • Measures how well you expect to represent true solution • Decreases with more complex model • Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset • Increases with more complex model
High variance Low variance Low bias High bias
Bias–variance decomposition • Training set • We want that minimizes https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Overfitting Age Age Age Underfitting Overfitting Tumor Size Tumor Size Tumor Size Slide credit: Andrew Ng
Addressing overfitting • size of house • no. of bedrooms • no. of floors • age of house • average income in neighborhood • kitchen size Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng
Addressing overfitting • 1. Reduce number of features. • Manually select which features to keep. • Model selection algorithm (later in course). • 2. Regularization. • Keep all the features, but reduce magnitude/values of parameters . • Works well when we have a lot of features, each of which contributes a bit to predicting . Slide credit: Andrew Ng
Overfitting Thriller • https://www.youtube.com/watch?v=DQWI1kvmwRg
Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression
Intuition • Suppose we penalize and make really small. Price ($)in 1000’s Price ($)in 1000’s Size in feet^2 Size in feet^2 Slide credit: Andrew Ng
Regularization. • Small values for parameters • “Simpler” hypothesis • Less prone to overfitting • Housing: • Features: • Parameters: Slide credit: Andrew Ng
Regularization Price ($)in 1000’s : Regularization parameter Size in feet^2 Slide credit: Andrew Ng
Question What if is set to an extremely large value (say )? • Algorithm works fine; setting to be very large can’t hurt it • Algorithm fails to eliminate overfitting. • Algorithm results in underfitting. (Fails to fit even training data well). • Gradient descent will fail to converge. Slide credit: Andrew Ng
Question What if is set to an extremely large value (say )? Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng
Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression
Regularized linear regression : Number of features is not panelized Slide credit: Andrew Ng
Gradient descent (Previously) Repeat { } Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng
Comparison : Weight decay Regularized linear regression Un-regularized linear regression
Normal equation Slide credit: Andrew Ng
Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression
Regularized logistic regression • Cost function: Age Tumor Size Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng
: Lasso regularization LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding Soft Thresholding operator
Multiple predictors: : Cyclic Coordinate Desce For each , update with where
L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
Things to remember • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression