440 likes | 1.57k Views
Correlation & Regression. A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest.
E N D
Correlation & Regression • A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest. • The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables).
Quantitative Variables • Dependent Variable (Y) • the variable being predicted • called the response variable • Independent Variable (X) • the variable used to explain or predict Y • called the explanatory or predictorvariable
Correlation & Regression • Correlation • Addresses the questions: “Is there a relationship between X and Y?” “If so, how strong is it?” • Regression • Addresses the question “What is the relationship between X and Y?”
Simple Linear Relationship • A linear (straight line) relationship between Y and a single X. • The form of the equation is Y = b0 + b1 X, where b0is the y-intercept and b1is the slope • A scatter-plot of X versus Y is useful for spotting linear relationships, and obvious departures from linear. • Always start with a scatter plot!!
Correlation • A correlation exists between two variables when they are related in some way. • Linear Correlation Coefficient (r) • measures the strength of the linear relationship between X and Y • Properties of r • -1 ≤ r ≤ 1 • r=1 for a perfect positive linear relationship • r= -1 for a perfect negative linear relationship • r = 0 if there is no linear relationship
Sample Correlation Coefficient • Statistics that is useful for estimating the linear correlation coefficient
Coefficient of Determination • The coefficient of determination is the proportion of variability in Y that can be explained by its linear relationship to X. • Computed by squaring the sample correlation squared (r2)
Hypothesis Testing of the Linear Correlation Coefficient • Appropriate Hypothesis:
Testing r • Test Statistic: • Rejection Region (3 cases of H1) • Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2 • Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα • Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα
Simple Linear Regression • The Least Squares Regression line is our "best" line for explaining the relationship between Y and X. • It minimizes the squared error (distance between the observed values and the values predicted by the line). • The predicted value of Y for any Xcan be found by plugging X into the least squares regression line.
Simple Linear Regression Line • The equation is: where and
Proper Use of Correlation & Regression • Correlation does not imply causation. • Simple linear regression is appropriate only if the data clusters about a line. • Do not extrapolate. • Do not apply model to other populations. • For multiple regression, the size of the parameter does not indicate importance.
Effect of Extreme Values • Extreme values can have a very large effect on correlation and regression analysis. • Influential outliers can largely impact model fit. • Regression Applet by Webster West
Model Assumptions for Inference • The difference between the observed and the model predicted values is called the residual, and is denoted by e: • The residuals are assumed to be independent and identically normal in distribution with mean 0 and standard deviation se. • So far a particular X, the distribution of Y can be described as normal with mean equal to the predicted value of Y for that X, and standard deviation equal to se.
Inference about the Simple Linear Regression Model Parameters • Is there a significant relationship between X and Y? H0: b1 = 0 versus H1: b1≠ 0 • Test Statistic:
Inference about the Simple Linear Regression Model Parameters • Rejection Region (3 cases of H1) • Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2 • Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα • Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα
Inference about the Simple Linear Regression Model Parameters • Is there a non-zero y-intercept in the linear relationship between X and Y? H0: b0 = 0 versus H1: b0≠ 0 • Test Statistic:
Inference about a Regression Line • E(Y) is the expected value of Y. For a given X, E(Y) is determined by evaluating the simple linear regression equation at X. A t-distribution allows a confidence interval for the true mean value of Y given an X.
Inference about Y for a Given X • The expected observation of Y for a given X is equal to E(Y). A t-distribution on E(Y) allows the construction a predication interval for prediction of a single observation for a particular value of X.
Residual Analysis • Can be useful for checking the model assumptions, which for the linear regression model are: • Independent observations • Residual have N(0,s2) distribution • Plots can be useful for spotting model inadequacy
Variable Selection in Multiple Regression • Compare all possible regressions • Backward elimination • Forward Selection • Stepwise Elimination