430 likes | 1.33k Views
Correlation and L inear Regression. By Arman Banimahd. From The Basic Practice of Statistics by David S. Moore. Response and Explanatory Variables. A Response Variable measures an outcome of a study. Dependent Variables
E N D
CorrelationandLinear Regression By Arman Banimahd From The Basic Practice of Statistics by David S. Moore
Response and Explanatory Variables A Response Variable measures an outcome of a study. • Dependent Variables An Explanatory Variable may explain or influence changes in response variable. • Independent variables From The Basic Practice of Statistics by David S. Moore
Demonstration Suppose we have data on a large group of college students. Find the response variables and the explanatory variables. • Amount of time spent studying for a statistics exam and grade on the exam. • Weight in kilograms and height in centimeters. • Hours per week of extracurricular activities and grade point average. • Score on the SAT Mathematics exam and score on the SAT Critical Reading exam. Suppose we have data on a large group of college students. Find the response variables and the explanatory variables. • Amount of time spent studying for a statistics exam and grade on the exam. • Weight in kilograms and height in centimeters. • Hours per week of extracurricular activities and grade point average. • Score on the SAT Mathematics exam and score on the SAT Critical Reading exam. From The Basic Practice of Statistics by David S. Moore
Scatterplot A scatterplotshows the relationship between two quantitative variables measured on the same individuals. Note: Always plot the explanatory variable on the horizontal axis of the scatterplot, if there is one. From The Basic Practice of Statistics by David S. Moore
Examining a Scatterplot In a graph, Look for: • Overall Pattern • Deviations • Direction • Form • Strength AnOutlier is an individual value that falls outside the overall pattern of the relationship. From The Basic Practice of Statistics by David S. Moore
Positive and Negative Associations Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increases. Ex. The relationship between your age and your father’s age Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase. Ex. The relationship between the amount of gas left in your car’s tank and the number of miles you travel From The Basic Practice of Statistics by David S. Moore
Correlation The correlation measures the direction and strength of the linear relationship between two quantitative variables. (usually written as ) Formula: or more compactly, From The Basic Practice of Statistics by David S. Moore
Correlation Note: Correlation is always between -1 and 1. Positive Correlation From The Basic Practice of Statistics by David S. Moore
Correlation Note: Correlation is always between -1 and 1. Negative Correlation From The Basic Practice of Statistics by David S. Moore
Correlation Note: Correlation is always between -1 and 1. No Correlation From The Basic Practice of Statistics by David S. Moore
Regression Lines A regression line is a straight line that describes how a response variable changes as an explanatory variable changes. Note: We often use a regression line to predict the value of for a given value of . From The Basic Practice of Statistics by David S. Moore
Review of Straight Lines Suppose that is a response variable (plotted on the vertical axis).A straight line relating to has an equation of the form: where is the slope, and is the -intercept. Note: • slope is the amount by which changes when increases by one unit. • -intercept is the value of when . From The Basic Practice of Statistics by David S. Moore
Review of Straight Lines Point 1: Point 2: Slope: Y-intercept: 2 Equation of the line: 4 3 2 1 1 2 3 4 From The Basic Practice of Statistics by David S. Moore
Exercise 1 We expect a car’s highway gas mileage to be related to its city gas mileage. Data for all 1198 vehicles in the government’s 2008 Fuel Economy Guide give the regression line for predicting highway mileage from city mileage. • What is the slope of this line? Say in words what the numerical value of the slope tells you. • Slope = 1.109 • It tells us that highway mpg goes up by 1.109 mpg for each added city mpg. From The Basic Practice of Statistics by David S. Moore
Exercise 1 We expect a car’s highway gas mileage to be related to its city gas mileage. Data for all 1198 vehicles in the government’s 2008 Fuel Economy Guide give the regression line for predicting highway mileage from city mileage. • What is the intercept? Explain why the value of the intercept is not statistically meaningful. • Intercept = 4.62 • When city mpg is 0, we expect the highway mpg to be 0 as well, but the intercept shows otherwise. From The Basic Practice of Statistics by David S. Moore
Exercise 1 We expect a car’s highway gas mileage to be related to its city gas mileage. Data for all 1198 vehicles in the government’s 2008 Fuel Economy Guide give the regression line for predicting highway mileage from city mileage. • Find the predicted highway mileage for a car that gets 16 miles per gallon on the city. From The Basic Practice of Statistics by David S. Moore
Exercise 2 You use the same bar of soup to shower each morning. The bar weights 80 grams when it is new. Its weight goes down by 6 grams per day on the average. What is the equation of the regression line for predicting weight from days of use? where represents the weight from days of use, and represents the number of days used. From The Basic Practice of Statistics by David S. Moore
Least-Squares Regression Line The least-squares regression line of on is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. From The Basic Practice of Statistics by David S. Moore
Equation of Least-Squares Regression Line Suppose that is a response variable and is the explanatory variable. Then the least-square regression line is the line: where slope is: and the -intercept is: From The Basic Practice of Statistics by David S. Moore
Some Review Formulas , , , From The Basic Practice of Statistics by David S. Moore
Demonstration An outbreak of the deadly Ebola virus in 2002 and 2003 killed 91 of the 95 gorillas in 7 home ranges in the Congo. To study the spread if the virus, measure “distance” by the number of home ranges separating a group of gorillas from the first group infected. Here are data on distance and number of days until deaths began in each later group. From The Basic Practice of Statistics by David S. Moore
Demonstration Solution on Excel From The Basic Practice of Statistics by David S. Moore
Facts About Least-Squares Regression • The distinction between explanatory and response variables is essential in regression. • The Slope of the least-squares line and the correlation always have the same sign. (A change of one standard deviation in corresponds to a change of standard deviations in ). • The least-squares regression line always passes through the point on the graph of against . • The square of the correlation, , is the fraction of the variation in the values of that is explained by the least-squares regression of on . From The Basic Practice of Statistics by David S. Moore
Residual A residual is the difference between an observed value of the response variable and the value predicted by the regression line. Note: • The mean of the least-squares residuals is always zero. • A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data. From The Basic Practice of Statistics by David S. Moore
Exercise 3 Do heavier people burn more energy? We will use these data to illustrate influence. • Make a scatterplot of the data that is suitable for predicting metabolic rate from body mass, with two new points added. Point A: mass 42 kilograms, metabolic rate 1500 calories. Point B: mass 70 kilograms, metabolic rate 1400 calories. From The Basic Practice of Statistics by David S. Moore
Exercise 3 From The Basic Practice of Statistics by David S. Moore
Exercise 3 Do heavier people burn more energy? We will use these data to illustrate influence. • Add three least-squares regression line to your plot: for the original 12 women, for the original women plus Point A, and for the original women plus Point B. Which new point is more influential for the regression line? Explain in simple language why each new point moves the line in the way your graph shows. From The Basic Practice of Statistics by David S. Moore
Exercise 3 From The Basic Practice of Statistics by David S. Moore
Exercise 3 From The Basic Practice of Statistics by David S. Moore
Exercise 3 From The Basic Practice of Statistics by David S. Moore
Influential Observation An observation is influentialfor a statistical calculation if removing it would noticeably change the result of the calculation. Ex. Points that are outliers in either the or direction of a scatterplot are often influential for the correlation. From The Basic Practice of Statistics by David S. Moore
Cautions About Correlation & Regression • Correlation and regression lines describe only linear relationships. You can do the calculations for any relationship between two quantitative variables, but the results are useful only if the scatterplot shows a linear pattern. • Correlation and least-squares regression lines are not resistant. Always plot your data and look for observations that may be influential. From The Basic Practice of Statistics by David S. Moore
Extrapolation Extrapolationis the use of a regression line for prediction far outside the range of values of the explanatory variable that you use to obtain the line. Such predictions are often not accurate. Ex. Predicting height of a 25 year-old person based on a regression line of a set of data on a child’s growth between 3 and 8 years of age. (predicted height is 8 feet.) From The Basic Practice of Statistics by David S. Moore
Remarks Beware of Lurking Variable A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables. Association Does Not Imply Causation: An association between an explanatory variable and a response variable , even if it is very strong, is not by itself good evidence that changes in actually cause changes in . From The Basic Practice of Statistics by David S. Moore
Exercise 4 How strongly do physical characteristics of sisters and brothers correlate? Here are data on the heights (in inches) of 11 adult pairs: • Find the correlation and the equation of the least-squares line for predicting sister’s height from brother’s height. Make a scatterplot of the data and add the regression line to your plot. From The Basic Practice of Statistics by David S. Moore
Exercise 4 From The Basic Practice of Statistics by David S. Moore
Exercise 4 From The Basic Practice of Statistics by David S. Moore
Exercise 4 From The Basic Practice of Statistics by David S. Moore
Exercise 4 How strongly do physical characteristics of sisters and brothers correlate? Here are data on the heights (in inches) of 11 adult pairs: • Adam is 70 inches tall. Predict the height of his sister Kim. inches From The Basic Practice of Statistics by David S. Moore
Questions? From The Basic Practice of Statistics by David S. Moore
Summary • Correlation • Regression Line • Least-squares Regression Line • High Correlation Does NOT Imply Causation • Beware of Lurking Variables and Avoid Extrapolation From The Basic Practice of Statistics by David S. Moore
Exercise 5 (Attendance Quiz) Because elderly people may have difficulty standing to have their heights measured, a study looked at predicting overall height from height to the knee. Here are the data (in centimeters) for five elderly men. What is the equation of the least-squares regression line for predicting height from knee height? From The Basic Practice of Statistics by David S. Moore