AMCS/CS 340: Data Mining

Regression AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Outline • What is regression? • Minimizing sum-of-square error • Linear regression • Nonlinear regression • Statistic models for regression • Overfitting and Cross validation for regression 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification (reminder) X  Y • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • … • discrete: • {0,1} binary • {1,…k} multi-class • tree, etc. structured 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Regression X  Y • continuous: • , d • Anything: • continuous (,d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • … • discrete: • {0,1} binary • {1,…k} multi-class • tree, etc. structured 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Examples • Data  Predicting or Forecasting: • Processes, memory  Power consumption • Protein structure  Energy • Heart-beat rate, age, speed, duration  Fat • Oil supply, consumption, etc  Oil price • …… 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Linear regression 40 26 24 Temperature 22 20 20 30 40 20 30 20 10 0 10 0 10 20 0 0 Given examples given a new point Predict 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

26 24 22 20 30 40 20 30 Prediction 20 10 Prediction 10 0 0 Linear regression 40 Temperature 20 0 0 20 7

Outline • What is regression? • Minimizing sum-of-square error • Linear regression • Nonlinear regression • Statistic models for regression • Overfitting and Cross validation for regression 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Sum-of-Squares Error Function Observation Prediction Error or “residual” minimizing Sum squared error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Sum-of-Squares Error Function Minimizing E(w) to find w* E(w) ---- quadratic function of w If derivative of E(w) w.r.t. w--- linear of w unique solution for minimizing E(w) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Sum squared error Least Squares Error or “residual” Observation Prediction 0 0 20 Estimate the unknown parameter w by minimizing 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Minimize the sum squared error Sum squared error =0 Predict http://www.lri.fr/~xlzhang/KAUST/CS340_slides/linear_regression_demo1.m 12

Linear regression models Generally where Фj(x)are known as basis functions. Typically, Ф0(x) = 1, so that w0 acts as a bias. e.g. = [0 x x2 x3 ]

Linear regression models Example: Polynomial Curve Fitting Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Basis Functions Polynomial basis functions: These are global; a small change in xaffectsall basis functions. Gaussianbasis functions: Sigmoidal basis functions: where Gaussian and Sigmoidal basis functions are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width or slope).

Minimize the sum squared error Sum squared error =0 Predict http://www.lri.fr/~xlzhang/KAUST/CS340_slides/linear_regression_demo2.m 16

Batch gradient descent algorithm (1) If derivative of E(w) w.r.t. wis not linear of w, Batch gradient descent Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Batch gradient descent algorithm (2) Batch gradient descent example Update the value of w according to the gradient Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Batch gradient descent algorithm (3) Batch gradient descent example Iteratively approaches the optimum of the Error function Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Statistical models for regression • Machine Learning paradigms • Regression Tree (CART) • Neural Networks • Support Vector Machine • Strength • flexibility within a nonparametric, assumption-free openwork that accommodates big data • Weakness • difficulty in interpreting the effects of each covariate 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Regression Tree Target value 21 http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree_reg.htm

SVM Regression Given training data Find: , such that optimally describes the data: Sum of errors ● ● ● Subject to: ● ● ● ● “Support vectors” 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

NN Regression Minimize 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline • What is regression? • Minimizing sum-of-square error • Overfittingand Cross validation for regression 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Regression overfitting • Which one is the best ? • The one with best fit to the data ? • How well is it going to predict future data drawn from the same distribution? x 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline • What is regression? • Minimizing sum-of-square error • Overfittingand Cross validation for regression • Test set method • Leave-one-out • K-fold cross validation 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The test set method 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The test set method • Pros • very simple • Can then simply choose the method with the best test-set score • Cons • Wastes data: we get an estimate of the best method to apply to 30% less data • If we don’t have much data, our test-set might just be lucky or unlucky (“test-set estimator of performancehas high variance”) 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

LOOCV (Leave-one-out Cross Validation) 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

LOOCV (Leave-one-out Cross Validation) When you have done all points, report the mean error 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

LOOCV for Linear Regression For k=1 to N 1. Let (xk,yk) be the k-th record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining N-1 data points 4. Note your error (xk,yk). When you’ve done all points, report the mean error. MSELOOCV = 2.12 37

LOOCV for Quadratic Regression For k=1 to N 1. Let (xk,yk) be the k-th record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining N-1 data points 4. Note your error (xk,yk). When you’ve done all points, report the mean error. MSELOOCV = 0.962 38

LOOCV for Join the Dots For k=1 to N 1. Let (xk,yk) be the k-th record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining N-1 data points 4. Note your error (xk,yk). When you’ve done all points, report the mean error. MSELOOCV = 3.33 39

Which one of validations ? k-fold Cross Validation gets the best of both worlds 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error. Linear Regression MSE3FOLD = 2.05 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error. Quadratic Regression MSE3FOLD = 1.11 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored RedGreen and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error. Join the Dots MSE3FOLD = 2.93 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

CV-based Regression Algorithm Choice • Choosing which regression algorithm to use • Step 1: Compute 10-fold-CV error for six different model • Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use. 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining