700 likes | 872 Views
Chapter 4 Regression Topics. Credits : Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth notes. Regression Review. Linear Regression models a numeric outcome as a linear function of several predictors. It is the king of all statistical and data mining models ease of interpretation
E N D
Chapter 4 Regression Topics Credits: Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth notes Data Mining - Massey University
Regression Review • Linear Regression models a numeric outcome as a linear function of several predictors. • It is the king of all statistical and data mining models • ease of interpretation • mathematically concise • tends to perform well for prediction, even under violations of assumptions Data Mining - Massey University
Regression Review • We will focus on regression as a predictive task - • Characteristics • numeric response - ideally real valued • numeric predictors- but not necessarily • Goals of regression analysis for data mining • explanation - which variables are most important and which are not needed • prediction • inference (significance and C.I. of predictors) not a focus • interactions of variables Data Mining - Massey University
Examples of Regression tasks • Credit scoring • gas mileage of cars • how much money will a customer spend? • what factors are important for high cholesterol • Predicting yields of a crop • what strategies result in high scores in baseball (or cricket!) Data Mining - Massey University
Example: Prostate Cancer • Data set ‘prostate.txt’ • Predicting the prostate-specific antigen • log-cancer volume (lcavol) is the response variable • predictors: • prostrate weight (weight) • age • benign prostatic hyperplasia (lbph) • capsular penetration (lcp) • Gleason score (gleason) • percent Gleason of 4 or 5 (pgg45) Data Mining - Massey University
Prostate Data • using summary() and cor() to look at data Data Mining - Massey University
Linar Regression Model • Basic model: • you are not modelling y, but you are modelling the mean of y for a given x! • Simple Regression - one x. • easy to describe, good for mathematics, but not used often in data mining • Multiple regression - many x - • response surface is a plane…harder to conceptualize Data Mining - Massey University
Linear Regression Model • Assumptions: • linearity • constant variance • normality of errors • residuals ~ Normal(mu,sigma^2) • Assumptions must be checked, • but if inference is not the goal, you can accept some deviation from assumptions (don’t’ tell the statisticians I said that!) • Multicollinearity also an issue • creates unstable estimates Data Mining - Massey University
Fitting the Model • We can look at regression as a matrix problem • We want a score function which minimizes “a”: = which is minimized by predicton follows easily: Data Mining - Massey University
Comments on Multivariate Linear Regression • prediction is a linear function of the parameters • Model structure is simple…. • p-1 dimensional hyperplane in p-dimensions • Linear weights => interpretability • Useful as a baseline model • to compare more complex models to Data Mining - Massey University
Limitations of Linear Regression • True relationship of X and Y might be non-linear • Suggests generalizations to non-linear models • Correlation/Collinearity among the X variables • Can cause numerical instability • Problems in interpretability (identifiability) • Includes all variables in the model… • But what if p=100 and only 3 variables are related to Y? Data Mining - Massey University
Regression fit to Prostate data Data Mining - Massey University
Diagnostic Plots Data Mining - Massey University
Checking assumptions • linearity • look to see if transformations make relationships ‘more’ linear • normality of errors • diagnostic plots help show patterns of ‘opening’ variance or other strange behavior • influence • highly ‘influential’ cases have undue impact on the analysis Data Mining - Massey University
Simplest way to check assumptions: • Plot of residuals vs. fits • A scatter plot with residuals on the y axis and fitted values on the x axis. • Helps to identify non-linearity, outliers, and non-constant variance. Data Mining - Massey University
A well-behaved residuals vs. fits plot • The residuals “bounce randomly” around the 0 line. (Linearity is reasonable). • No one residual “stands out” from the basic random pattern of residuals. (No outliers). • The residuals roughly form a “horizontal band” around 0 line. (Constant variance). Data Mining - Massey University
Detecting Violations of Linearity Data Mining - Massey University
How a non-linear function shows up on a residual vs. fits plot • The residuals depart from 0 in some systematic manner: • such as, being positive for small x values, negative for medium x values, and positive again for large x values Data Mining - Massey University
Corrections for Linearity Violations • Finding the right correction is often not obvious • We need a curvilinear relationship between x and y • Problem: there are many different possible curvilinear relationships • Polynomials, exponential, logarithmic, sinusoidal, ... • Approaches: trial-and-error, gut feeling, experience, domain knowledge, etc. Data Mining - Massey University
Detecting Violations of Non-constant Variance Data Mining - Massey University
How non-constant error variance shows up on a residual vs. fits plot • The plot has a “fanning” effect. • Residuals are close to 0 for small x values and are more spread out for large x values. • Or, the spread of the residuals can vary in some more complex fashion. Data Mining - Massey University
Corrections for Non-constant Variance • Transformation of the dependent variable y • Problem: which transformation to use • Logarithmic, square-root transformations often alleviate the “fanning effect” • Make large values much smaller; leave small values unchanged • To find the right transformation: trial-and-error, gut feeling, experience, domain knowledge, etc. Data Mining - Massey University
Detecting Violations of Independence Data Mining - Massey University
Residuals vs. order plot • Helps assess serial correlation (a form of non-independence) of error terms. • If the data are obtained in a time (or space) sequence, this plot helps to see if there is any correlation between the error terms that are near each other in the sequence. • It’s only appropriate if you know the order in which the data were collected! Data Mining - Massey University
Normal random noise Data Mining - Massey University
A time trend Data Mining - Massey University
Positive serial correlation Residuals tend to be followed, in time, by residuals of the same sign and about the same magnitude. Data Mining - Massey University
Negative serial correlation Residuals of one sign tend to be followed, in time, by residuals of the opposite sign. Data Mining - Massey University
Corrections for Independence Violations • Modeling the autocorrelation either explicitly or implicitly • simple way • if autocorrelation is simple lag-1 or seasonal: • remove the main effect and model the residuals. • more complex way • directly model the autocorrelation through time series model • ARIMA models can tell you if there are periodic effects and model them • R Functions arima(), acf() Data Mining - Massey University
Detecting Violations of Normality Data Mining - Massey University
Normal (probability) plot of residuals • Helps assess normality of error terms. • If data follow a normal distribution with mean m and variance s2, then a plot of percentiles of the normal distribution versus sample percentiles should be approximately linear. Data Mining - Massey University
Another example: Normal residuals Data Mining - Massey University
Another example: Normal residuals but with one outlier Data Mining - Massey University
Another example: Skewed (positive) residuals Data Mining - Massey University
Another example: Heavy-tailed residuals Data Mining - Massey University
Corrections for Normality Violations • Corrections for skewness, heavy tails • Transformations of the response variable • Box-Cox transformation, logarithm, etc • The goal is to make the distribution shaped almost like a bell • Corrections for bi-modality • Break data up into two (or more) clusters • Alternatively, use mixture models for analysis • Corrections for outliers • Remove outlier, if it is an invalid point • Erroneous data entry, wrong population, etc • Fit different functional relationship, if it is a valid point • Outlier may suggest curvilinear relationship, rather than linear one Data Mining - Massey University
Checking assumptions • Influence • H is called the hat matrix: • The element of H for a given observation is its influence • The leverage hi quantifies the influence that the observed response yi has on its predicted value y • It measures the distance between the X values for the ith case and the means of the X values for all n cases. • influence hi is a number between 0 and 1 inclusive. lets see a case where it goes wrong… Data Mining - Massey University
Scottish running data distance climb time 1 2.5 650 16.05 2 6.0 2500 48.21 3 6.0 900 33.39 4 7.5 800 45.36 5 8.0 3070 62.16 6 8.0 2866 73.13 … Data Mining - Massey University
Residual plots in R • “hist” to create histograms • “qq.plot” “qqplot” or “qqnorm” to create normal probability plots • “plot” to create regular scatterplots • Getting the residuals and fitted values: • Output regression model in the form “reg = lm(y~x) • Call residuals via reg$res • Call fitted values via reg$fit Data Mining - Massey University
Interpretation of Results • Parameter estimates: • if the jth predictor variable, xj is increased by one unit, while all the other predictor variables are kept fixed, then the response variable y will increase by aj. • conditional effect of each one, holding all others constant. • size of effect (not significance) depends on units • Multiple correlation coefficient - R2 • measures a ratio between • regression sum of squares - how much of the variance does the regression explain, and • the total sum of squares - how much variation is there altogether • If it is close to 1, your fit is good. But be careful. Data Mining - Massey University
Model selection: finding the best k variables • If noisy variables are included in the model, it can effect the overall performance. • Best to remove an predictors which have no effect, lest random patterns look significant. • How do we search over 2p models? • Heuristic search is used to search over model space: • Forward search (greedy) • Backward search (greedy) • Generalizations (add or delete) • Think of operators in search space • Branch and bound techniques (package ‘leaps’) • score function has to penalize for complexity, or can use cross validation • This type of variable selection problem is common to many data mining algorithms • Outer loop that searches over variable combinations • Inner loop that evaluates each combination Data Mining - Massey University
Stepwise Cautions • Stepwise tends to be conservative, but still can remove good variables due to its greedy search • somewhat arbitrarily deals with multicorrelation • interpretation of variable A changes when variable B disappears. • elaborate techniques tend to overfit the data • can help this by using cross-validation Data Mining - Massey University
Generalizing Linear Regression Data Mining - Massey University
Complexity versus Goodness of Fit Training data y x Data Mining - Massey University
Complexity versus Goodness of Fit Too simple? Training data y y x x Data Mining - Massey University