300 likes | 469 Views
+. Machine Learning and Data Mining Linear regression. (adapted from) Prof. Alexander Ihler. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A. Supervised learning. Notation Features x Targets y Predictions ŷ Parameters q.
E N D
+ Machine Learning and Data MiningLinear regression (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA
Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)
Linear regression “Predictor”: Evaluate line: return r • Define form of function f(x) explicitly • Find a good f(x) within that family 40 Target y 20 0 0 10 20 Feature x (c) Alexander Ihler
26 26 24 24 y 22 22 20 20 30 30 40 40 20 20 30 30 x1 20 20 x2 10 10 10 10 0 0 0 0 More dimensions? y x1 x2 (c) Alexander Ihler
Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler
Measuring error Error or “residual” Observation Prediction 0 0 20 (c) Alexander Ihler
Mean squared error • How can we quantify the error? • Could choose something else, of course… • Computationally convenient (more later) • Measures the variance of the residuals • Corresponds to likelihood under Gaussian model of “noise” (c) Alexander Ihler
MSE cost function • Rewrite using matrix form (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler
Visualizing the cost function J(θ) θ 0 θ1 (c) Alexander Ihler
Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)
Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “surface”: error residual for that θ… (c) Alexander Ihler
+ Machine Learning and Data MiningLinear regression: direct minimization (adapted from) Prof. Alexander Ihler
MSE Minimum • Consider a simple problem • One feature, two data points • Two unknowns: µ0, µ1 • Two equations: • Can solve this system directly: • However, most of the time, m > n • There may be no linear function that hits all the data exactly • Instead, solve directly for minimum of MSE function (c) Alexander Ihler
SSE Minimum • Reordering, we have • X (XT X)-1 is called the “pseudo-inverse” • If XT is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler
Matlab SSE • This is easy to solve in Matlab… % y = [y1 ; … ; ym] % X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’ (c) Alexander Ihler
5 4 18 3 16 2 14 1 0 162 cost for this one datum Heavy penalty for large errors 12 5 -20 -15 -10 -5 0 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Effects of MSE choice • Sensitivity to outliers (c) Alexander Ihler
L1 error L2, original data 18 L1, original data 16 L1, outlier data 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler
Something else entirely… Cost functions for regression (MSE) (MAE) (???) “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler
+ Machine Learning and Data MiningLinear regression: nonlinear features (adapted from) Prof. Alexander Ihler
Nonlinear functions • What if our hypotheses are not lines? • Ex: higher-order polynomials (c) Alexander Ihler
Nonlinear functions • Single feature x, predict target y: • Sometimes useful to think of “feature transform” Add features: Linear regression in new features (c) Alexander Ihler
Higher-order polynomials • Fit in the same way • More “features”
Features • In general, can use any features we think are useful • Other information about the problem • Sq. footage, location, age, … • Polynomial functions • Features [1, x, x2, x3, …] • Other functions • 1/x, sqrt(x), x1 * x2, … • “Linear regression” = linear in the parameters • Features we can make as complex as we want! (c) Alexander Ihler
Higher-order polynomials • Are more features better? • “Nested” hypotheses • 2nd order more general than 1st, • 3rd order ““ than 2nd, … • Fits the observed data better
Complex model Overfitting and complexity • More complex models will always fit the training data better • But they may “overfit” the training data, learning complex relationships that are not really present Simple model Y Y X X (c) Alexander Ihler
Test data • After training the model • Go out and get more data from the world • New observations (x,y) • How well does our model perform? Training data New, “test” data (c) Alexander Ihler
Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 30 Training data 25 New, “test” data 20 15 • 0th to 1st order • Error decreases • Underfitting • Higher order • Error increases • Overfitting 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Training versus test error Mean squared error Polynomial order (c) Alexander Ihler