Regression

Regression Review and Extension

The Formula for a Straight Line • Only one possible straight line can be drawn once the slope and Y intercept are specified • The formula for a straight line is: • Y = bx + a • Y = the calculated value for the variable on the vertical axis • a = the intercept • b = the slope of the line • X = a value for the variable on the horizontal axis • Once this line is specified, we can calculate the corresponding value of Y for any value of X entered

The Line of Best Fit • Real data do not conform perfectly to a straight line • The best fit straight line is that which minimizes the amount of variation in data points from the line • Note that this is a key idea, you get to choose how you want to minimize some estimate of variability about a regression line • The typical approach is the least squares method • The equation for this line can be used to predict or estimate an individual’s score on Y on the basis of his or her score on X

Least Squares Modeling • When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models • The intercept and weight values are called the parameters of the model. • We’ll assume that our models are causal models, such that the variable on the left-hand side of the equation is being caused by the variable(s) on the right side.

Terminology • The values of Y in these models are often called predicted values, sometimes abbreviated as Y-hat or • They are the values of Y that are implied or predicted by the specific parameters of the model.

Parameter Estimation • In estimating the parameters of our model, we are trying to find a set of parameters that minimizes the error variance. In other words, we want the sum of the squared residuals to be as small as it possibly can be. • The process of finding this minimum value is called least-squares estimation.

Least-squares estimation • The relevant equations:

Estimates of a and b • Estimating the Slope (the regression coefficient) • Estimating the Y intercept • These calculations ensure that the regression line passes through the point on the scatterplot defined by the means of X and Y

Relationship to r

Standardized regression coefficient • Standardized slope is often given in output, and will have added usefulness within multiple regression • When normally distributed scores are changed into Z scores the mean is 0 and standard deviation is 1. Referring to our previous formula: • So r would be equal to the slope, and interpreted as 1 sd unit of change in X leads to a b sd unit change in Y

What can the model explain? • Total variability in the dependent variable (observed – mean) comes from two sources • Variability predicted by the model i.e. what variability in the dependent variable is due to the independent variable • How far off our predicted values are from the mean of Y • Error or residual variability i.e. variability not explained by the independent variable • The difference between the predicted values and the observed values S2y S2 S2(yi - i) Total variance = predicted variance + error variance

R-squared - the coefficient of determination • The square of the correlation, r², is the fraction of the variation in the values of y that is explained by the regression of y on x • Conceptually: • R²= variance of predicted values y variance of observed values y

R2 • The shaded portion shared by the two circles represents the proportion of shared variance: the larger the area of overlap, the greater the strength of the association between the two variables A Venn Diagram Showing r2 as the Proportion of Variability Shared by Two Variables (X and Y)

Predicted variance and r2

Interpreting regression summary • Intercept • Value of Y if X is 0 • Often not meaningful, particularly if it’s practically impossible to have an X of 0 (e.g. weight) • Slope • Amount of change in Y seen with 1 unit change in X • Standardized regression coefficient • Amount of change in Y seen in standard deviation units with 1 standard deviation unit change in X • In simple regression it is equivalent to the r for the two variables • Standard error of estimate • Essentially the standard deviation of the residuals • The difference involves dividing by df residuals for the model (see) vs. n-1 (sd) • As R2 goes up, it goes down • Statistical significance of the model • R2 • Proportion of variance explained by the model

The Caution of Causality • Correlation does not prove causality, but… • One can’t establish causality without correlation • One thing to remember is that just because things look good for your model, other models may be as viable or even better

Assumptions in regression • For starters: • Linear relationship between the independent and dependent variable • Residuals are normally distributed • Residuals are independent

Heteroscedasticity • We also assume residuals have the same variance about the regression line • Homoscedasticity • Example of heteroscedasticity

Interval measures and measurement without error • Ordinal variables are not to be used as the differences among levels is not constant • But we like our Likerts! • Most suggest that at least 5 to lessen the impact of ordinal differences (7 or more better) • Measurement without error • Must have reliable measures involved • More random error will lead to larger error variance • Less reliable, smaller R2

Violating assumptions • Usual situation • Slight problems may not result in much change in type I error • However, type II will be a major concern with even modest violations • With multiple violations, type I may also suffer • Additional assumptions will be made for multiple independent variables

Outliers • As outliers can greatly influence r, they will naturally influence any analysis using it • Detecting and dealing with outliers is a part of the process of regression analysis • One issue is distinguishing univariate vs. multivariate outliers • While a data point might be an outlier on a variable, it may not be as far as the model goes • Conversely, what might be an outlier for the model, might not have it’s individual variable values noted as outliers

Robust Regression • A single unusual point can greatly distort the picture regarding the relationship among variables • Heteroscedasticity, even in ‘normal’ situations, inflates the standard error of estimate and decreases our estimate of R2 • Nonnormality can hamper our ability to come up with useful interval measures for slopes

Robust Regression • While least squares regression performs well in general if we are conducting hypothesis testing regarding independence, it is poor at detecting associations in less than ideal circumstances • What we would like are methods that perform well in a variety of circumstances, and compete well with least-squares regression under ideal conditions • To be discussed: • Theil-Sen Estimator • Regression via robust correlation • L regression • Least trimmed squares • Least trimmed absolute value • Least median of squares • M-estimators • Deepest regression line

Theil-Sen Estimator • For any pair of data points regarding a relationship between two variables, we can plot those 2 points, produce a line connecting them, and note its slope • E.g. if we had 4 data points we could calculate 6 slopes • X = 1,2,3,4 • Y = 5,7,11,15 • If each of those slopes is weighted by the squared difference in X values for the appropriate points, the weighted average of all our slopes created would be the LS slope for the model • E.g. Create a line for the points, (1,5) and (2,7) • Slope = 2 • Weight by (1-2)2 • What if instead of a weighted average, the median of those slopes is chosen as our model slope estimate? • That in essence is the Theil-Sen estimator

Theil-Sen Estimator • Advantages • Competes with LS regression in ideal conditions • More resistant • Reduced standard error in problematic situations, e.g. heteroscedasticity • We can, using the percentile bootstrap method, calculate CIs as well *It has been shown that the median approach here performs better than trimming less

Regression via robust correlation • We could simple replace our regular r with a more robust estimate • This is possible but more work needs to be done to figure out which approaches might be more viable, and it appears bias might be a problem in some cases with this approach (e.g. heteroscedastic situations using a winsorized r)

Least Absolute Value • Instead of minimizing the sum of the squared residuals, we could choose a method that attempts to minimize the sum of the absolute residuals • L1 regression • Problem: while protecting against outliers on Y, it does not for values of on the predictor

Least Trimmed Squares • The least trimmed squares approach involves trimming the smallest and largest residuals • So if h is the amount of values left after trimming and • Then the goal would be to minimize the sum of the squared residuals of the remaining data • Note again that optimal trimming amount is about ~.2

S-plus menu example • The first two show the standard menu availability of least trimmed squares regression • The last uses the robust library

Least Trimmed Absolute Value • Same approach, but rather than minimize the trimmed squared residuals, we minimize the sum of the absolute residuals remaining after trimming • This may be preferable to LTS in heteroscedastic situations

Least Median of Squares • Find the slope and intercept that minimizes the median of the squared residuals • Doesn’t seem to perform as well generally as other robust approaches

M-estimators • In general, regression using M-estimators minimize the sum of some function of the residuals • Where ξ is a function used to guard against outliers and heteroscedasticity • E.g. ξ(r1) = r2 would give us our regular LS result • Although there are many M-estimator approaches one might be able to choose from given the newness of the approach in general and our relative lack of research regarding it, Wilcox suggest the adjusted M-estimator seems to work well in practical situations • First checks for “bad” leverage points and may ignore in estimate of slope and intercept

Leverage points • Leverage is one aspect of ‘outlierness’ that we’ll mention here but come back to later • It is primarily concerned with outliers among predictors • E.g. Mahalanobis distance • Good leverage points may be extreme with regard to predictors but is not an outlier with regard to the model • In LS, it can decrease the standard error • Bad leverage points are extreme and would not lie close to a line that would fit most of the data well, and have a profound effect on your estimate of the slope

Leverage points

Deepest regression line • One of the more recent developments, and may be of practical use as it is researched further • It is really more about linear fit (i.e. matching parameters to data) as opposed to focus on the observations/residuals themselves • Depth is the number of observations that would need to be removed to make the data ‘nonfit’ • Appears to have a breakdown point of about 1/3 regardless of the number of predictors

Summary • In single predictor situations, alternatives are available that perform well in ideal situations, and much better than the LS approach in others • Theil-Sen in particular • While we have kept to the single predictor, this will typically never be our research situation in using regression analysis • These methods can also be generalized to the multiple predictor setting, but their breakdown point (i.e. resistance advantage) decreases as more predictors enter into the equation

Summary • Again we call on the Tukey suggestion • “… just which robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.” • A general approach: • Check for linearity • Perhaps using a smoother • If ok there, then use an estimator with a breakdown point of about .2-.3, and compare with LS output • If notable differences between LS and robust exist, figure out why and determine which is more appropriate • If assumptions are tenable and little difference between LS and robust exists, feel comfortable going with the LS output

Regression

Regression

Presentation Transcript

Regression Analysis Simple Regression

Regression

Regression

Regression

Regression

Regression

Regression

REGRESSION

Regression

Regression

REGRESSION

Regression

Regression Linear Regression Regression Trees

Regression Linear Regression

Regression

REGRESSION

Regression

Regression

Regression Analysis Simple Regression

REGRESSION

Regression

Regression