1 / 21

Correlation & Regression

Correlation & Regression. A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest.

kent
Download Presentation

Correlation & Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation & Regression • A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest. • The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables).

  2. Quantitative Variables • Dependent Variable (Y) • the variable being predicted • called the response variable • Independent Variable (X) • the variable used to explain or predict Y • called the explanatory or predictorvariable

  3. Correlation & Regression • Correlation • Addresses the questions: “Is there a relationship between X and Y?” “If so, how strong is it?” • Regression • Addresses the question “What is the relationship between X and Y?”

  4. Simple Linear Relationship • A linear (straight line) relationship between Y and a single X. • The form of the equation is Y = b0 + b1 X, where b0is the y-intercept and b1is the slope • A scatter-plot of X versus Y is useful for spotting linear relationships, and obvious departures from linear. • Always start with a scatter plot!!

  5. Correlation • A correlation exists between two variables when they are related in some way. • Linear Correlation Coefficient (r) • measures the strength of the linear relationship between X and Y • Properties of r • -1 ≤ r ≤ 1 • r=1 for a perfect positive linear relationship • r= -1 for a perfect negative linear relationship • r = 0 if there is no linear relationship

  6. Sample Correlation Coefficient • Statistics that is useful for estimating the linear correlation coefficient

  7. Coefficient of Determination • The coefficient of determination is the proportion of variability in Y that can be explained by its linear relationship to X. • Computed by squaring the sample correlation squared (r2)

  8. Hypothesis Testing of the Linear Correlation Coefficient • Appropriate Hypothesis:

  9. Testing r • Test Statistic: • Rejection Region (3 cases of H1) • Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2 • Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα • Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα

  10. Simple Linear Regression • The Least Squares Regression line is our "best" line for explaining the relationship between Y and X. • It minimizes the squared error (distance between the observed values and the values predicted by the line). • The predicted value of Y for any Xcan be found by plugging X into the least squares regression line.

  11. Simple Linear Regression Line • The equation is: where and

  12. Proper Use of Correlation & Regression • Correlation does not imply causation. • Simple linear regression is appropriate only if the data clusters about a line. • Do not extrapolate. • Do not apply model to other populations. • For multiple regression, the size of the parameter does not indicate importance.

  13. Effect of Extreme Values • Extreme values can have a very large effect on correlation and regression analysis. • Influential outliers can largely impact model fit. • Regression Applet by Webster West

  14. Model Assumptions for Inference • The difference between the observed and the model predicted values is called the residual, and is denoted by e: • The residuals are assumed to be independent and identically normal in distribution with mean 0 and standard deviation se. • So far a particular X, the distribution of Y can be described as normal with mean equal to the predicted value of Y for that X, and standard deviation equal to se.

  15. Inference about the Simple Linear Regression Model Parameters • Is there a significant relationship between X and Y? H0: b1 = 0 versus H1: b1≠ 0 • Test Statistic:

  16. Inference about the Simple Linear Regression Model Parameters • Rejection Region (3 cases of H1) • Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2 • Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα • Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα

  17. Inference about the Simple Linear Regression Model Parameters • Is there a non-zero y-intercept in the linear relationship between X and Y? H0: b0 = 0 versus H1: b0≠ 0 • Test Statistic:

  18. Inference about a Regression Line • E(Y) is the expected value of Y. For a given X, E(Y) is determined by evaluating the simple linear regression equation at X. A t-distribution allows a confidence interval for the true mean value of Y given an X.

  19. Inference about Y for a Given X • The expected observation of Y for a given X is equal to E(Y). A t-distribution on E(Y) allows the construction a predication interval for prediction of a single observation for a particular value of X.

  20. Residual Analysis • Can be useful for checking the model assumptions, which for the linear regression model are: • Independent observations • Residual have N(0,s2) distribution • Plots can be useful for spotting model inadequacy

  21. Variable Selection in Multiple Regression • Compare all possible regressions • Backward elimination • Forward Selection • Stepwise Elimination

More Related