Regularization

Regularization Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative • Women in Data Science Blacksburg • Location: Holtzman Alumni Center • Welcome, 3:30 - 3:40, Assembly hall • Keynote Speaker: MilindaLakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall • Career Panel, 4:05 - 5:00, hall • Break , 5:00 - 5:20, Grand hallAssembly • Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall • Dinner with breakout discussion groups, 5:45 - 7:00, Museum • Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall • Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

k-NN (Classification/Regression) • Model • Cost function None • Learning Do nothing • Inference , where

Linear regression (Regression) • Model • Cost function • Learning 1) Gradient descent: Repeat {} 2) Solving normal equation • Inference

Naïve Bayes (Classification) • Model • Cost function Maximum likelihood estimation: Maximum a posteriori estimation : • Learning (Discrete ) (Continuous )mean , variance , • Inference

Logistic regression (Classification) • Model • Cost function • Learning Gradient descent: Repeat {} • Inference

Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

How about MAP? • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)

Prior • Common choice of : • Normal distribution, zero mean, identity covariance • “Pushes” parameters towards zeros • Corresponds to Regularization • Helps avoid very large weights and overfitting Slide credit: Tom Mitchell

MLE vs. MAP • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP)

Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

Multi-class classification • Email foldering/taggning: Work, Friends, Family, Hobby • Medical diagrams: Not ill, Cold, Flu • Weather: Sunny, Cloudy, Rain, Snow Slide credit: Andrew Ng

Binary classification Multiclass classification

One-vs-all (one-vs-rest) Class 1: Class 2: Class 3: Slide credit: Andrew Ng

One-vs-all • Train a logistic regression classifier for each class to predict the probability that • Given a new input , pick the class that maximizes Slide credit: Andrew Ng

Discriminative Approach Ex: Logistic regression Estimate directly (Or a discriminant function: e.g., SVM) Prediction Generative Approach Ex: Naïve Bayes Estimate and Prediction

Further readings • Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf • Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Regularization • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 Slide credit: Andrew Ng

Overfitting • If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very wellbut fail to generalize to new examples (predict prices on new examples). Slide credit: Andrew Ng

Example: Linear regression Price ($)in 1000’s Price ($)in 1000’s Price ($)in 1000’s Underfitting Overfitting Just right Size in feet^2 Size in feet^2 Size in feet^2 High bias High variance Slide credit: Andrew Ng

Bias-Variance Tradeoff • Bias: difference between what you expect to learn and truth • Measures how well you expect to represent true solution • Decreases with more complex model • Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset • Increases with more complex model

High variance Low variance Low bias High bias

Bias–variance decomposition • Training set • We want that minimizes https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Overfitting Age Age Age Underfitting Overfitting Tumor Size Tumor Size Tumor Size Slide credit: Andrew Ng

Addressing overfitting • size of house • no. of bedrooms • no. of floors • age of house • average income in neighborhood • kitchen size Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng

Addressing overfitting • 1. Reduce number of features. • Manually select which features to keep. • Model selection algorithm (later in course). • 2. Regularization. • Keep all the features, but reduce magnitude/values of parameters . • Works well when we have a lot of features, each of which contributes a bit to predicting . Slide credit: Andrew Ng

Overfitting Thriller • https://www.youtube.com/watch?v=DQWI1kvmwRg

Intuition • Suppose we penalize and make really small. Price ($)in 1000’s Price ($)in 1000’s Size in feet^2 Size in feet^2 Slide credit: Andrew Ng

Regularization. • Small values for parameters • “Simpler” hypothesis • Less prone to overfitting • Housing: • Features: • Parameters: Slide credit: Andrew Ng

Regularization Price ($)in 1000’s : Regularization parameter Size in feet^2 Slide credit: Andrew Ng

Question What if is set to an extremely large value (say )? • Algorithm works fine; setting to be very large can’t hurt it • Algorithm fails to eliminate overfitting. • Algorithm results in underfitting. (Fails to fit even training data well). • Gradient descent will fail to converge. Slide credit: Andrew Ng

Question What if is set to an extremely large value (say )? Price ($)in 1000’s Size in feet^2 Slide credit: Andrew Ng

Regularized linear regression : Number of features is not panelized Slide credit: Andrew Ng

Gradient descent (Previously) Repeat { } Slide credit: Andrew Ng

Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng

Comparison : Weight decay Regularized linear regression Un-regularized linear regression

Normal equation Slide credit: Andrew Ng

Regularized logistic regression • Cost function: Age Tumor Size Slide credit: Andrew Ng

Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng

: Lasso regularization LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding Soft Thresholding operator

Multiple predictors: : Cyclic Coordinate Desce For each , update with where

L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

Terminology

Things to remember • Overfitting • Cost function • Regularized linear regression • Regularized logistic regression

Regularization

Regularization

Presentation Transcript

Basis Expansions and Regularization

Linear Model Selection and regularization

: : regularization

Basis Expansions and Regularization

GPS-INS resampling & regularization

Path Space Regularization Framework

Regularization

Regularization in Matrix Relevance Learning

GPS-INS resampling & regularization

Bayesian regularization of learning

Unfolding: Regularization and error assignment

The Gaussian Kernel , Regularization

Regularization

Inverse Halftoning via Nonlocal Regularization

Attendance Regularization - HR Management Software

Basis Expansions and Regularization

Basis Expansion and Regularization

Basis Expansion and Regularization

Basis Expansions and Regularization

Unfolding: Regularization and error assignment

Regularization

Regularization

Presentation Transcript

Basis Expansions and Regularization

Linear Model Selection and regularization

: : regularization

Basis Expansions and Regularization

GPS-INS resampling &amp; regularization

Path Space Regularization Framework

Regularization

Regularization in Matrix Relevance Learning

GPS-INS resampling &amp; regularization

Bayesian regularization of learning

Unfolding: Regularization and error assignment

The Gaussian Kernel , Regularization

Regularization

Inverse Halftoning via Nonlocal Regularization

Attendance Regularization - HR Management Software

Basis Expansions and Regularization

Basis Expansion and Regularization

Basis Expansion and Regularization

Basis Expansions and Regularization

Unfolding: Regularization and error assignment

GPS-INS resampling & regularization

GPS-INS resampling & regularization