Linear Regression

Linear Regression Fall 2014 The University of Iowa Tianbao Yang

Content • Linear Regression with one variable • Probability Interpretation • Linear Basis Function Models • Optimization • Multiple Outputs • Regularization and Lasso • Bias and Variance Tradeoff • Model Selection

Linear Regression with One Variable • Example: predict house price • Training Data: a set of examples • input (feature): size of house • output (target): house price price First-order Linear Regression size

Linear Regression with One Variable • How to estimate the model parameters price size

Linear Regression with One Variable • How to estimate the model parameters • Criterion: minimize the error on training data • the loss function measures the error price Loss Function (function in parameters) size

Linear Regression with One Variable • To estimate the model parameters • Criterion: minimize the error on training data • the loss function measures the error • minimize the sum of all losses Square loss Least Square Regression

Content • Linear Regression with one variable • Probability Interpretation • Linear Basis Function Models • Optimization • Multiple Outputs • Regularization and Lasso • Bias and Variance Tradeoff • Bayesian Regression

Supervised Learning • Training examples: • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory

Probability Interpretation • Training Data: a set of examples • input (feature): size of house • output (target): house price Random variable Random variable price Standard Gaussian Noise size

Data Likelihood • Training Data: a set of examples variance variance mean mean Data Likelihood i.i.d assumption

Maximum Likelihood Estimation (MLE) • Estimate the model parameters • Maximum Likelihood Estimation

MLE is Equivalent to Least Square Regression • Maximum Likelihood Estimation

MLE is Equivalent to Least Square Regression • Least Square Regression IS • Maximum Likelihood Estimation

Probability Interpretation of Linear Regression • Linear Regression with one variable • Probability Interpretation • Linear Basis Function Models • Optimization • Multiple Outputs • Regularization and Lasso • Bias and Variance Tradeoff • Model Selection

Linear Basis Function Models • Example: Polynomial Curve Fitting

0th order Polynomial

1st order Polynomial

3rd order Polynomial

9th order Polynomial

Linear Basis Function Models • generally • where are known as basis functions. • typically , so that acts as a bias.

Linear Basis Function Models • Polynomial basis functions: • These are global; a small change in x affect all basis functions.

Linear Basis Function Models • Gaussian basis functions: • These are local; a small change in x only affect nearby basis functions.

Linear Basis Function Models • Sigmoidal basis functions: • These are local; a small change in x only affect nearby basis functions.

Linear Regression with Multi-Variables • Example: predict house price • Training Data: a set of examples • input (features) • size of house • year of house • etc • output (target): house price

Least Square Regression • Minimize Sum of Square Loss

Procedures of Machine Learning • A Three-step view of Machine Learning • data collection (and pre-processing) • model building (and analysis) • optimization Data Optimization Model

Optimization • Minimize Sum of Square Loss • Unconstrained Convex Optimization 1. compute the gradient with respect to (w.r.t)

Optimization • Unconstrained Convex Optimization 2. set the gradient to zero

Geometry of Least Square • Minimize Sum of Square Loss • subspace • , minimize the distance between and its orthogonal projection

Large-scale Regression • expensivecomputation : the number of training data points and the dimensionality are both large Too many features Too many data

Gradient Descent • Gradient Descent

Stochastic Gradient Descent • Stochastic Gradient Descent Step-size

Stochastic Gradient Descent • Stochastic Gradient Descent VS Gradient Descent

Multi-task Learning • Predict multiple outputs • Example: predict current house price, and house-price after two years

Multi-task Learning • predict multiple outputs from the same features

Content • Linear Regression with one variable • Probability Interpretation • Linear Basis Function Models • Optimization • Multiple Outputs • Regularization and Lasso and more • Bias and Variance Tradeoff

9th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Avoid Over-fitting: Regularization • Consider the error function: • With the sum-of-squares error function and a quadratic regularizer, we get • which is minimized by Loss term + Regularization term Regularization parameter Ridge Regression

Avoid Over-fitting: Regularization

Geometric Explanation

Analytical Explanation • See homework

Probability Interpretation • Maximum a posteriori Estimation • Bayes' Theorem • Ridge regression is to maximize a posterior distribution Data Likelihood Prior of model Posterior of model

Probability Interpretation • Maximum a posteriori Estimation • Bayes' Theorem • Ridge regression is to maximize a posterior distribution

Probability Interpretation • Maximum a posteriori Estimation • prior distribution is Gaussian distribution

Linear Regression