320 likes | 466 Views
STAT 110 - Section 5 Lecture 19. Professor Hao Wang University of South Carolina Spring 2012. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A. Last time: normal density curve. Chapter 14 – Describing Relationships.
E N D
STAT 110 - Section 5 Lecture 19 Professor Hao Wang University of South Carolina Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA
Chapter 14 – Describing Relationships • Most statistical studies examine data on more than one variable. The steps when trying to talk about two variables at once are the same as what we used earlier in the semester with just one variables: • Plot the data. • Look for overall patterns and deviations from those patterns. • Use numerical summaries.
Scatterplots scatterplot – shows the relationship between two quantitative variables measured on the same individuals • Values of one variable appear on the x-axis. This is typically the one doing the explaining – the explanatory, predictor, or independent variable. • Values of the other variable appear on the y-axis. This is typically the one being explained – called the response or dependent variable.
Scatterplot Example: When water flows across farmland, some of the soil is washed away, resulting in erosion. An experiment was conducted to investigate the effect of the rate of water flow on the amount of soil washed away. Flow is measured in liters/second and the eroded soil is measured in kilograms. flow rate .31 .85 1.26 2.47 3.75 eroded soil .82 1.95 2.18 3.01 6.07
Scatterplot • Is there an explanatory variable? • What’s the response variable? • Which variable should be on the x-axis?
Measuring Strength Through Correlation A Linear Relationship Correlation represented by the letterr : Indicator of how closely the values fall to a straight line. Measures linear relationships only; that is, it measures how close the individual points in a scatterplot are to a straight line.
Example : Verbal SAT and GPA Scatterplot of GPA and verbal SAT score. The correlation is .485, indicating a moderate positive relationship. Higher verbal SAT scores tend to indicate higher GPAs as well, but the relationship is nowhere close to being exact.
Example: Husbands’ and Wifes’ Ages and Heights Scatterplot of British husbands’ and wives’ heights (in millimeters); r = .36 Scatterplot of British husbands’ and wives’ ages; r = .94 Husbands’ and wives’ ages are likely to be closely related, whereas their heights are less likely to be so. Source: Marsh (1988, p. 315) and Hand et al. (1994, pp. 179-183)
Occupational Prestige and Suicide Rates Plot of suicide rate versus occupational prestige for 36 occupations. Correlation of .109 – these is not much of a relationship. If outlier removed r drops to .018. Source: Labovitz (1970, Table 1) and Hand et al. (1994, pp. 395-396)
Example : Professional Golfers’ Putting Success Scatterplot of distance of putt and putting success rates. Correlation r = −.94. Negative sign indicates that as distance goes up, success rate goes down. Source: Iman (1994, p. 507)
Which one has r = -0.86 ?
Which one has r = 0.52 ? (A was -0.86)
Cricket chirps and temperature. Each day, the temperature was recorded, as well as the number of times a cricket chirped in 15 seconds. According to this scatterplot, what can be said about the relationship between cricket chirps and temperature? • There appears to be a strong positive linear relationship. • There appears to be a weak linear relationship because the slope of the line is fairly flat. • There can’t be a relationship; cricket chirps can’t be related to temperature. • None of the above.
Summary: Features of Correlations r has no units and won’t change if we change the units of measurement r ignores the distinction between explanatory and response variables r is strongly affected by outliers http://bcs.whfreeman.com/ips4e/cat_010/applets/CorrelationRegression.html
Chapter 15 – Describing Relationships regression line – a straight line that describes how a response variable y changes as an explanatory variable x changes • regression line summarizes a linear relationship between two variables • one variable helps explain or predict the other
Example The data to the right concerns the relationship between the prevalence of a supposed fertility enhancer and the population of Oldenburg Germany in thousands of people between 1930 and 1936. The original data can be found in: Ornithologische Monatsberichte, 44, No.2, Jahrgang, 1936, Berlin, and 48, No.1, Jahrgang, 1940, Berlin, and Statistiches Jahrbuch Deutscher Gemeinden, 27-33, Jahrgang, 1932-1938, Gustav Fischer, Jena. X People 140 55.5 148 55.5 175 64.9 195 67.5 245 69.0 250 72.0 250 75.5
Example (cont’d) r = 0.941
Equation of a Line • The equation of a line is y = mx + b • m is the slope of the line • slope = the amount by which y changes when x increases one unit - a slope of zero means that there is no linear relationship between x and y • b is the intercept of the line • intercept = the value of y when x=0
Least Squares Regression Line least-squares regression line – the line that makes the sum of the squared vertical distances to the line as small as possible
Example – Fitting the Least Squares Line People = 35.49 + 0.1507 x Interpretation of the Slope: For every increase in X by 1, we expect the population of Oldenburg, Germany tends to increase by 150 people.
Prediction • Three Things to Understand about Prediction: • Prediction is based on fitting some “model” to a set of data. • Prediction works best when the model fits the data closely. • Prediction outside the range of the available data is risky. This is called an extrapolation.
Prediction Example People = 35.49 + 0.1507 x Using this equation to estimate the mean population of Oldenburg Germany for an X level of 200, we have 35.49 + 0.1507(200) = 65.63 So, we estimate the mean population of Oldenburg Germany (1930-1936) to be 65.53 thousand people for an X level of 200.
Correlation and Regression r2 - the fraction of the variation in the values of y that is explained by the least-squares regression of y on x In the example, (0.941)2=0.8857 of the variation in the population is explained by the regression using X.
Causation • The moral of the story is: • Only experimentation can show causation! • When dealing with regression and/or correlation, NEVER say that one variable causes another. • Snake bites and ice cream sales are highly correlated. Does that mean that one causes the other?
I would not trust a prediction from this regression for a car with City MPG of 10 because: A - The linear model doesn’t seem to fit the data B – It would be extrapolating C – It would have a large error because the points are very spread out around the line.
If a car gets 20 MPG in the city, how many MPG do you predict it will get on the highway? A – 0.896 B – 17.91 C – 26.97 D – 29.06
Each time the City MPG increases by one, what do we predict happens to the highway MPG ? Goes… A – Down 9.06 B – Down 0.896 C – Up 0.896 D – Up 9.06
What % of the variation in Highway MPG is explained by the regression using City MPG? A – 1.77% B – 89.1% C – 93% D – 94.39%
Suppose an algebra professor found that the correlation between study time (in hours) and exam score (out of 100) is +.80, and the regression line was found to be y = 20 + 4x. He arrived at this equation through years of collecting data on his students, most of whom reported studying anywhere from 0 to 20 hours for his exams. For which values of study time does the professor’s regression equation make sense in terms of predicting exam scores? a. Between 0 and 20 hours. b. Between 0 and 100 hours. c. Anything greater than or equal to 0 hours. d. It is not possible to predict exam score with study time.
Suppose the professor later found out that his correlation was not +.80, but rather it was +.08. How does this change the predictions he can make about exam scores based on study time? a. You have to take the results and divide them by 10, because .80/10 = .08. b. It won’t change the predictions because the regression line stays the same. c. The predictions should no longer be used because they won’t be very accurate. d. Not enough information to tell.