900 likes | 1.11k Views
Lecture 2: Linear Regression. Bivariate Statistics. (when two variables are involved). Basic Univariate Statistics. Mean Standard deviation Variance (you know this stuff). Reason for doing a correlation. To determine whether two variables are related
E N D
Bivariate Statistics (when two variables are involved)
Basic Univariate Statistics • Mean • Standard deviation • Variance • (you know this stuff)
Reason for doing a correlation • To determine whether two variables are related • Does one increase as the other increases? • Does one decrease as the other increases?
Not causality • Correlation doesn’t distinguish between IV and DV. • Causal relationship could go either way (or both ways, or neither way)
120 100 80 60 40 20 0 0 20 40 60 80 100 120 r = -.79
Another way to visualize correlation Variance in A Variance in b Variance in b Variance in A Small correlation Large correlation
Correlation Coefficient • A measure of degree of relationship • Degree to which large scores go with large scores, and small scores with small scores • Based on covariance • The correlation is the covariance, standardized
What is covariance again?(Review) • Variance of one variable • How spread out the datapoints are • Average (squared) distance between the points and their mean • S2 = ∑ (X – mean)2 / N-1
Variance vs. covariance • Variance = variation of a set of scores from their mean • Covariance = covariation of two sets of scores from their means • If someone’s X is a lot higher than the mean of X, their Y should be a lot higher than the mean of Y.
Covariance CovXY = ∑ ( X - ̅X ) ( Y - ̅Y ) N-1
Standardize the covariance to get the correlation • What does standardize mean? • Subtract the mean (so the scores are centered around 0) • Divide by the standard deviation (so the standard deviation and variance are rescaled to 1)
Correlation • r = CovXY / SDX SDY • Would come out the same if you standardized X and Y separately and took the covariance of them.
R2 • R2 is the proportion of variance in Y that is explained by X. • Important for regression, because you’re trying to explain as much of the variance in Y as possible, by combining a lot of Xs.
X2 X1 Y X3 X4 X5
Purpose of Regression Analysis • Predict the values of a DV based on values of one or more IVs
Differences between correlation and regression • Correlation just tells you whether there is a linear association between two variables. • Regression actually gives you an equation for the line. • In regression, you can have multiple predictors. • Regression implies (but doesn’t prove) directionality or causality.
Simple Linear Regression • Simple: • One independent (predictor) variable • Linear: • Straight-line relationship between IV and DV
Simple Linear Regression • Evaluates the relationship between two continuous variables. • X is the predictor, explanatory, independent variable. • Y is the response, outcome, dependent variable. • Which is X and which is Y depends on your hypothesis about causation.
Linear Regression Model Y = 0 + 1X + 0 = Y intercept 1 = slope = disturbance (variance in Y that can’t be explained by X)
β0 • The Y-intercept • Where the line crosses the Y-axis • The expected value of Y when X=0
β1 • The slope • The expected change in Y for each one-unit change in X
ε(Disturbance) • Variance in Y that cannot be explained by our X’s • Its expected value is 0 • (but in real life, it’s usually not 0 because the points don’t sit exactly on the line) • The ε for each X is independent of the ε’s of the other X’s
Example of a line without an error term(A function: Y = 1.8 X + 32) All points lie exactly on the line.
Example of a statistical relationship Points are near the line, but not exactly on it. A relationship with some “trend”, but also with some “scatter.”
Goal of regression • Find the best-fitting line that describes a scatterplot • Make the line as close as possible to all the points • Best-fitting line is the one with the lowest total residuals • (Residuals are the distances between each point and the line)
Most of the points are close to the line, but not exactly on the line. How do you know if this is the best-fitting line?
Goal • Find the line that makes the total residuals (distance from the points to the line) as small as possible
(Painless) review of calculus • How to find the smallest Something as possible: • Write an equation for it • Take the derivative (slope) • Solve for the value of X where the derivative=0 • That will be a minimum or maximum
Strategy • Write an equation for the distance between each point and the line • Sum it up over all points • Solve for the values of β0 and β1 such that the sum of the distances is a minimum
Use geometry and trigonometry to find the distance between the point and the line X-distance Y-distance
Equations…. ? Y=0+1X+ The Black Box
Using calculus, minimize (take derivative with respect to b0 and b1, set to 0, and solve for b0 and b1): The least squares regression line and get the least squares estimates b0 and b1:
Each point’s Y-value depends on 2 things: • Where the regression equation predicts it will be, given its X-value • How far away it is from its predicted position (the residual)
Our goal is to get the regression line to predict the point’s Y-value as closely as possible. • The unexplained part of the point’s location (residual) should be small in comparison to the explained part of the point’s location (predicted distance above or below the overall mean of Y).
Sums of Squares tell you how much of the point’s Y-value was explained by the regression line, and how much was unexplained.
Sum of Squares SSTotal = SSModel + SSError Total Sample Variability Unexplained Variability = Explained Variability +
Mean Squares • Divide each type of Sum of Squares by its degrees of freedom • Model df comes from the number of variables in the regression equation • Error df comes from the number of observations