1 / 27

Machine Learning and Data Mining Linear regression

+. Machine Learning and Data Mining Linear regression. (adapted from) Prof. Alexander Ihler. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A. Supervised learning. Notation Features x Targets y Predictions ŷ Parameters q.

Download Presentation

Machine Learning and Data Mining Linear regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. + Machine Learning and Data MiningLinear regression (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

  2. Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)

  3. Linear regression “Predictor”: Evaluate line: return r • Define form of function f(x) explicitly • Find a good f(x) within that family 40 Target y 20 0 0 10 20 Feature x (c) Alexander Ihler

  4. 26 26 24 24 y 22 22 20 20 30 30 40 40 20 20 30 30 x1 20 20 x2 10 10 10 10 0 0 0 0 More dimensions? y x1 x2 (c) Alexander Ihler

  5. Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler

  6. Measuring error Error or “residual” Observation Prediction 0 0 20 (c) Alexander Ihler

  7. Mean squared error • How can we quantify the error? • Could choose something else, of course… • Computationally convenient (more later) • Measures the variance of the residuals • Corresponds to likelihood under Gaussian model of “noise” (c) Alexander Ihler

  8. MSE cost function • Rewrite using matrix form (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler

  9. Visualizing the cost function J(θ) θ 0 θ1 (c) Alexander Ihler

  10. Supervised learning • Notation • Features x • Targets y • Predictions ŷ • Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters”θ Procedure (using θ) that outputs a prediction Training data (examples) Features Feedback / Target values Score performance (“cost function”)

  11. Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “surface”: error residual for that θ… (c) Alexander Ihler

  12. + Machine Learning and Data MiningLinear regression: direct minimization (adapted from) Prof. Alexander Ihler

  13. MSE Minimum • Consider a simple problem • One feature, two data points • Two unknowns: µ0, µ1 • Two equations: • Can solve this system directly: • However, most of the time, m > n • There may be no linear function that hits all the data exactly • Instead, solve directly for minimum of MSE function (c) Alexander Ihler

  14. SSE Minimum • Reordering, we have • X (XT X)-1 is called the “pseudo-inverse” • If XT is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler

  15. Matlab SSE • This is easy to solve in Matlab… % y = [y1 ; … ; ym] % X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’ (c) Alexander Ihler

  16. 5 4 18 3 16 2 14 1 0 162 cost for this one datum Heavy penalty for large errors 12 5 -20 -15 -10 -5 0 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Effects of MSE choice • Sensitivity to outliers (c) Alexander Ihler

  17. L1 error L2, original data 18 L1, original data 16 L1, outlier data 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  18. Something else entirely… Cost functions for regression (MSE) (MAE) (???) “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler

  19. + Machine Learning and Data MiningLinear regression: nonlinear features (adapted from) Prof. Alexander Ihler

  20. Nonlinear functions • What if our hypotheses are not lines? • Ex: higher-order polynomials (c) Alexander Ihler

  21. Nonlinear functions • Single feature x, predict target y: • Sometimes useful to think of “feature transform” Add features: Linear regression in new features (c) Alexander Ihler

  22. Higher-order polynomials • Fit in the same way • More “features”

  23. Features • In general, can use any features we think are useful • Other information about the problem • Sq. footage, location, age, … • Polynomial functions • Features [1, x, x2, x3, …] • Other functions • 1/x, sqrt(x), x1 * x2, … • “Linear regression” = linear in the parameters • Features we can make as complex as we want! (c) Alexander Ihler

  24. Higher-order polynomials • Are more features better? • “Nested” hypotheses • 2nd order more general than 1st, • 3rd order ““ than 2nd, … • Fits the observed data better

  25. Complex model Overfitting and complexity • More complex models will always fit the training data better • But they may “overfit” the training data, learning complex relationships that are not really present Simple model Y Y X X (c) Alexander Ihler

  26. Test data • After training the model • Go out and get more data from the world • New observations (x,y) • How well does our model perform? Training data New, “test” data (c) Alexander Ihler

  27. Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 30 Training data 25 New, “test” data 20 15 • 0th to 1st order • Error decreases • Underfitting • Higher order • Error increases • Overfitting 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Training versus test error Mean squared error Polynomial order (c) Alexander Ihler

More Related