Regression Analysis

Summer Course: Data Mining Regression Analysis Presenter: Georgi Nalbantov August 2009

Structure • Regression analysis: definition and examples • Classical Linear Regression • LASSO and Ridge Regression (linear and nonlinear) • Nonparametric (local) regression estimation:kNN for regression, Decision trees, Smoothers • Support Vector Regression (linear and nonlinear) • Variable/feature selection (AIC, BIC, R^2-adjusted)

Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

Common Data Mining tasks • k-th Nearest Neighbour • Parzen Window • Unfolding, Conjoint Analysis, Cat-PCA Clustering ClassificationRegression + X 2 X 2 + + + + + + + + + + + - + + + + + + - + + + - - + + + + + + + + - + - + X 1 X 1 X 1 • Linear Discriminant Analysis, QDA • Logistic Regression (Logit) • Decision Trees, LSSVM, NN, VS • Classical Linear Regression • Ridge Regression • NN, CART

Linear regression analysis: examples

The Regression task • Given data on m explanatory variables and 1 explained variable, where the explained variable can take real values in 1, find a function that gives the “best” fit: • Given: ( x1, y1 ), … , ( xm, ym)  nX 1 • Find:  : n   1 “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k, ym+k)is minimal

Classical Linear Regression (OLS) • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model: • b1 > 0  Positive Association • b1 < 0  Negative Association • b1 = 0  No Association

Classical Linear Regression (OLS) b0 Mean response whenx=0 (y-intercept) b1 Change in mean response when x increases by 1 unit (slope) b0,b1 are unknown parameters (like m) b0+b1x Mean response when explanatory variable takes on the value x Task:Minimize the sum of squared errors:

Classical Linear Regression (OLS) • Parameter: Slope in the population model (b1) • Estimator: Least squares estimate: • Estimated standard error: • Methods of making inference regarding population: • Hypothesis tests (2-sided or 1-sided) • Confidence Intervals x1 y

Classical Linear Regression (OLS)

Classical Linear Regression (OLS) • Coefficient of determination (r2) : proportion of variation in y “explained” by the regression on x. where

Classical Linear Regression (OLS):Multiple regression • Numeric Response variable (y) • p Numeric predictor variables • Model: • Y = b0 + b1x1 +  + bpxp+ e • Partial Regression Coefficients: bi effect (on the mean response) of increasing the ith predictor variable by 1 unit, holding all other predictors constant

Classical Linear Regression (OLS):Ordinary Least Squares estimation • Population Model for mean response: • Least Squares Fitted (predicted) equation, minimizing SSE:

Classical Linear Regression (OLS):Ordinary Least Squares estimation • Model: • OLS estimation: • LASSO estimation: • Ridge regression estimation:

LASSO and Ridge estimation of model coefficients sum(|beta|) sum(|beta|)

Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers

How to Choose k or h? Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers • When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity • As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low complexity • Cross-validation is used to finetune k or h.

middle-sized area ● ● ● ● ● ● ● ● Expenditures Expenditures ● ● ● ● ● ● Age Age Linear Support Vector Regression small area biggest area ● ● ● ● Expenditures ● ● ● “Support vectors” Age “Suspiciously smart case” (overfitting) “Compromise case”, SVR (good generalisation) “Lazy case” (underfitting) • The thinner the “tube”, the more complex the model

● ● ● ● Expenditures ● ● ● Age Nonlinear Support Vector Regression • Map the data into a higher-dimensional space:

Nonlinear Support Vector Regression: Technicalities • The SVR function: • To find the unknown parameters of the SVR function, solve: Subject to: • How to choose , , • = RBF kernel: • Find , , , and from a cross-validation procedure

SVR Technicalities: Model Selection • Do 5-fold cross-validation to find and for several fixed values of .

SVR Study : Model Training, Selection and Prediction CVMSE(IR*, HR*, CR*) True returns (red) and raw predictions (blue) CVMSE(IR*, HR*, CR*)

SVR: Individual Effects

SVR Technicalities: SVR vs. OLS • Performance on the test set • Performance on the test set SVR MSE= 0.04 OLS MSE= 0.23

Technical Note:Number of Training Errors vs. Model Complexity Min. number of training errors, Model complexity test errors training errors complexity Functions ordered in increasing complexity Best trade-off MATLAB video here…

Variable selection for regression • Akaike Information Criterion (AIC). Final prediction error:

Variable selection for regression • Bayesian Information Criterion (BIC), also known as Schwarz criterion. Final prediction error: • BIC tends to choose simpler models than AIC.

Variable selection for regression • R^2-adjusted:

Conclusion / Summary / References • Classical Linear Regression • LASSO and Ridge Regression (linear and nonlinear) • Nonparametric (local) regression estimation:kNN for regression, Decision trees, Smoothers • Support Vector Regression (linear and nonlinear) • Variable/feature selection (AIC, BIC, R^2-adjusted) (any introductory statistical/econometric book) http://www-stat.stanford.edu/~tibs/lasso.html , Bishop, 2006 Alpaydin, 2004, Hastie et. el., 2001 Smola and Schoelkopf, 2003 Hastie et. el., 2001,(any statistical/econometric book)

Regression Analysis