290 likes | 402 Views
Correlation & Regression. Chapter 10. Introduction. Another area of inferential statistics involves determining whether a relationship exists between two or more quantitative variables For example:
E N D
Correlation & Regression Chapter 10
Introduction • Another area of inferential statistics involves determining whether a relationship exists between two or more quantitative variables • For example: • Business person deciding whether volume of sales for given month is related to amount of advertising the firm does that month • Educators interested in how number of hours a student studies is related to student’s score on an exam • Medical researchers interested in determining if caffeine is related to heart damage
Introduction cont. • Correlation • Statistical method used to determine whether a relationship between variables exists • Regression • Statistical method used to describe nature of relationship between variables, that is, positive or negative, linear or nonlinear • Questions to be answered • Are two or more variables related? • If so, what is strength of relationship? • What type of relationship exists? • What kind of predictions can be made from relationship?
Types of Relationships • Two types of relationships: simple and multiple • Simple relationship • One independent (explanatory) variable, and one dependent (response) variable • Simple relationship analysis is called simple regression • Positive relationship – exists when both variables increase or decrease at the same time • Negative relationship – exists when one variable increases as the other decreases, and vice versa • Multiple relationship • Two or more independent variables are used to predict one dependent variable
10.1 – Scatter Plots & Regression • In simple correlation and regression studies, researcher collects data on two quantitative variables to see whether a relationship exists between them • Independent variable can be controlled or manipulated (designated as x-axis variable) • Dependent variable cannot be controlled or manipulated (designated as y-axis variable)
Scatter Plots • Scatter plot • Graph of ordered pairs (x, y) of numbers consisting of independent variable x and the dependent variable y • Visual way to describe nature of relationship between independent and dependent variables • After plot is drawn, it should be analyzed to determine which type of relationship, if any, exists • Example 10 – 1 • P. 536 • Example 10 – 2 • P. 537 • Example 10 – 3 • P. 538
Correlation • Statisticians use correlation coefficient to determine strength of linear relationship between two variables • Pearson product moment correlation coefficient (PPMC) • Named after statistician Karl Pearson, who pioneered research in this area • Correlation coefficient • Computed from sample data measures strength and direction of linear relationship between two variables • Symbol for sample correlation coefficient is r • Symbol for population correlation coefficient is ρ (Greek letter rho)
Formula for Correlation Coefficient • Range of the correlation coefficient is from -1 to +1 • Value of r close to +1 suggests strong positive linear relationship • Value of r close to -1 suggests strong negative linear relationship • Value of r close to 0 suggest weak or no relationship • Formula for Correlation Coefficient r Where n is the number of data pairs
Example 10 – 4 • Compute the correlation coefficient for data in example 10-1
Significance of Correlation Coefficient • Question arises, when is value of r due to change, and when does it suggest a significant linear relationship between the variables? • Since value of r is computed from samples, two possibilities exist when r is not equal to zero • Either value of r is high enough to conclude there is significant linear relationship OR • Value of r is due to change • To make a decision, use a hypothesis-testing procedure similar to the traditional method
Population Correlation Coefficient • Sample correlation coefficient can be used as an estimator of p (rho) if following assumptions are valid • Variables x and y are linearly related • Variables are random variables • Two variables have a bivariate normal distribution • Population correlation coefficient • Correlation computed by using all possible pairs of data values (x,y) taken from a population
Hypothesis Testing • In hypothesis testing, one of these is true • OR • When null hypothesis is rejected at a specific level, it means there is a significant difference between the value of r and 0. • When null hypothesis is not rejected, it means value of r is not significantly different from 0 and is probably due to chance • Do not have to identify claim, since question will always be whether there is significant linear relationship between variable
Formula for t Test • Formula for t Test for Correlation Coefficient with degrees of freedom equal to n – 2 • Example 10 – 7 • Test the significance of the correlation coefficient found in example 10 – 4. Use α = 0.05 and r = 0.982
Correlation and Causation • When a hypothesis test indicates that a significant linear relationship exists between variables, researchers must consider possibilities outlined next. • Possible Relationships Between Variables • When null hypothesis has been rejected for a specific α value, any of the following five possibilities can exist: • There is a direct cause-and-effect relationship between variables • There is a reverse cause-and-effect relationship between variables • Relationship between variables may be caused by a third variable • There may be a complexity of interrelationships among many variables • Relationship may be coincidental • Remember, correlation does not necessarily imply causation
10.2 – Regression • If value of correlation coefficient is significant, next step is to determine equation of regression line • Regression line • Data’s line of best fit • Allows researcher to see rend and make predictions on basis of the data
Line of Best Fit • Given a scatter plot, you must be able to draw the line of best fit • Line of best fit • Line drawn so that sum of squares of vertical distances from each point in scatter plot to line is at a minimum
Determination of Regression Line Equation • Linear equation in algebra is written as • In statistics, regression line is written as Where • Formula for Regression Line y’= a + bx • and • Rounding rule: round values of a and b to three decimal places
Examples • 10 – 9 • Find the equation of the regression line for data in example 10 – 4 and graph the line on the scatter plot of the data • 10 – 11 • Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles
Assumptions • Marginal change • Magnitude of change in one variable when the other variable changes exactly 1 unit • When r is not significantly different from 0, best predictor of y is mean of data values of y • For valid predictions, value of correlation coefficient must be significant, also two other assumptions must be met: • For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line • The standard deviation of each of the dependent variables must be the same for each value of the independent variable
Checking for Outliers • All scatter plots should be checked for outliers • Influential points/ influential observations • Points that can affect equation of regression line • When point on scatter plot seems to be an outlier it should be checked to see if it is an influential pointbecause influential points seem to “pull” regression line towards it • Researchers should use their judgment whether to include influential observations in final analysis of data • If researcher feels observation is not necessary, then it should be excluded so it does not influence results of study • If researcher feels that it is necessary, he or she may want to obtain additional data values whose x values are near x value of influential point
10.3 – Coefficient of Determination & Standard Error of the Estimate • If correlation coefficient can is significant then equation of regression line can be determined • Other measures are associated with correlation and regression techniques: • Coefficient of determination • Standard error of the estimate • Prediction interval
Regression Model • Consider this hypothetical regression model • X values: {1, 2, 3, 4, 5} • Y values: {10, 8, 12, 16, 20} • Regression line equation is: and r = 0.919 • For each value of x there is an observed value and a predicted y’ value • When x = 1, y = 10, and y’ = 7.6 • Recall that closer the y’ values are to actual y values then the better the fit and closer r is to +1 or -1
Total Variation • Total variation • Sum of squares of vertical distances each point is from mean • Explained variation • Variation obtained from the relationship (y’ predicted values) • Unexplained variation • Variation due to chance • *Total variation = Explained variation + unexplained variation*
Residuals & Least-Squares • Residual • Difference between actual value of y and predicted y’ value for a given x value • Least-squares line • Another name for a regression line because it is computed using sum of squares of residuals is the smallest possible value
Coefficient of Determination • Coefficient of determination • Measure of the variation of the dependent variables that is explained by the regression line and the independent variable • Ratio of explained variation and total variation • Can also be found by squaring the r value • Coefficient of nondetermination • Found by subtracting coefficient of determination from 1
Standard Error of the Estimate • When a y’ value is predicted for a specific x value, prediction is a point estimate • Standard error of the estimate • Denoted by sest, is the standard deviation of the observed y values about the predicted y’ values • Prediction interval uses this statistic • Formula for standard error of estimate is
Examples • 10 – 12 • A researcher collects the following data (page 569) and determines that there is a significant relationship between age of a copy machine and its monthly maintenance cost. The regression line is Find the standard error of the estimate
Prediction Interval • Prediction interval • Similar to a confidence interval where the standard error of the estimate is used to create an interval about a y’ value • By selecting an α value, you can achieve a confidence that the interval contains the actual mean of the y values that correspond to the given x value • Formula for the Prediction Interval about a Value y’ • With d.f. = n – 2
Example 10 – 14 • For the data in Example 10 – 12, find the 95% prediction interval for the monthly maintenance cost of a machine that is 3 years old