800 likes | 810 Views
Explore the connection between age and a blood test measure through a study, interpreting scatterplots, and calculating correlation coefficients.
E N D
Chapter 4 Describing Bivariate Numerical Data Created by Kathy Fritz
This line can be used to estimate the age of a crime victim from a blood test. Forensic scientists must often estimate the age of an unidentified crime victim. Prior to 2010, this was usually done by analyzing teeth and bones, and the resulting estimates were not very reliable. A study described in the paper “Estimating Human Age from T-Cell DNA Rearrangements” (Current Biology [2010]) examined the relationship between age and a measure based on a blood test. Age and the blood test measure were recorded for 195 people ranging in age from a few weeks to 80 years. A scatterplot of the data appears to the right. Do you think there is a relationship? If so, what kind? If not, why not?
Correlation Pearson’s Sample Correlation Coefficient Properties of r
Does it look like there is a relationship between the two variables? • If so, is the relationship linear? Yes Yes
Does it look like there is a relationship between the two variables? • If so, is the relationship linear? Yes Yes
Does it look like there is a relationship between the two variables? • If so, is the relationship linear? Yes No, looks curved
Does it look like there is a relationship between the two variables? • If so, is the relationship linear? Yes No, looks parabolic
Does it look like there is a relationship between the two variables? • If so, is the relationship linear? No
Linear relationships can be either positive or negative in direction. Are these linear relationships positive or negative? Negative Positive
A, C, B, D When the points in a scatterplot tend to cluster tightly around a line, the relationship is described as strong. Try to order the scatterplots from strongest relationship to the weakest. These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010). A B C D
Pearson’s Sample Correlation Coefficient • Usually referred to as just the correlation coefficient • Denoted by r • Measures the strengthand direction of a linear relationship between two numerical variables The strongest values of the correlation coefficient are r = +1 and r = -1. The weakest value of the correlation coefficient is r = 0. An important definition!
Properties of r • The sign of rmatches the direction of the linear relationship. r is positive r is negative
Properties of r • The value of r is always greater than or equal to -1 and less than or equal to +1. Strong correlation Moderate correlation Weak correlation
Properties of r 3. r = 1 only when all the points in the scatterplot fall on a straight line that slopes upward. Similarly, r = -1 when all the points fall on a downward sloping line.
Properties of r 4. r is a measure of the extent to which x and y are linearly related Find the correlation for these points: Compute the correlation coefficient? Sketch the scatterplot. Does this mean that there is NO relationship between these points? r = 0 r = 0, but the data set has adefiniterelationship!
Properties of r • The value of r does not depend on the unit of measurement for either variable. Calculate r for the data set of mares’ weight and the weight of their foals. r = -0.00359 r = -0.00359 Change the mare weights to pounds by multiply Kg by 2.2 and calculate r.
Calculating Correlation Coefficient The correlation coefficient is calculated using the following formula: where and
The web site www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student-related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000. Here is the scatterplot: Does the relationship appear linear? Explain.
College Expenditures Continued: To compute the correlation coefficient, first find the z-scores. To interpret the correlation coefficient, use the definition – There is a positive, moderate linear relationship between six-year graduation rates and student-related expenditures.
How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zyis positive zxzyis negative zx is positive zyis positive zxzyis positive zx is negative zyis negative zxzyis positive Will the sum of zxzybe positive or negative?
How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zyis positive zxzyis negative zx is positive zyis positive zxzyis positive zx is negative zyis negative zxzyis positive zx is negative zyis positive zxzyis negative Will the sum of zxzybe positive or negative?
How the Correlation Coefficient Measures the Strength of a Linear Relationship Will the sum of zxzybe positive or negative or zero?
Association does NOT imply causation. Does a value of r close to 1 or -1 mean that a change in one variable causesa change in the other variable? Consider the following examples: • The relationship between the number of cavities in a child’s teeth and the size of his or her vocabulary is strong and positive. • Consumption of hot chocolate is negatively correlated with crime rate. Causality can only be shown by carefully controlling values of all variables that might be related to the ones under study. In other words, with a well-controlled, well-designed experiment. Should we all drink more hot chocolate to lower the crime rate? Both are responses to cold weather So does this mean I should feed children more candy to increase their vocabulary? These variables are both strongly related to the age of the child
Linear Regression Least Squares Regression Line
Suppose there is a relationship between two numerical variables. Let x be the amount spent on advertising and y be the amount of sales for the product during a given period. You might want to predict product sales (y) for a month when the amount spent on advertising is $10,000 (x). The other variable, denoted by x, is the predictor variable (sometimes called independent or explanatory variable). The letter y is used to denoted the variable you want to predict, called the response variable (or dependent variable).
Where: b – is the slope of the line it is the amount by which y increases when x increases by 1 unit a – is the intercept (also called y-intercept or vertical intercept) it is the height of the line above x = 0 in some contexts, it is not reasonable to interpret the intercept The equation of a line is:
The Deterministic Model We often say x determinesy. Notice, the y-value is determined by substituting the x-value into the equation of the line. Also notice that the points fall on the line. But, when we fit a line to data, do all the points fall on the line?
How do you find an appropriate line for describing a bivariate data set? The point (15,44) has a deviation of +4. To assess the fit of a line, we need a way to combine the n deviations into a single measure of fit. To assess the fit of a line, we look at how the points deviate vertically from the line. What is the meaning of this deviation? y = 10 + 2x What is the meaning of a negative deviation?
Least squares regression line The least squares regression line is the line that minimizes the sum of squared deviations. The most widely used measure of the fit of a line y = a + bx to bivariate data is the sum of the squared deviations about the line.
(3,10) 6 -3 (6,2) -3 (0,0) Let’s investigate the meaning of the least squares regression line. Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2). Use a calculator to find the least squares regression line Find the sum of the squares of the deviations from the line What is the sum of the deviations from the line? Will the sum always be zero? Find the vertical deviations from the line Hmmmmm . . . Why does this seem so familiar? The line that minimizesthe sum of squared deviations is the least squares regression line. Sum of the squares = 54
Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3) x 11 15 19 23 27 y 150 270 450 580 740 Sketch a scatterplot for this data set. Average tumor volume Number of days after injection
Interpretation of slope: The average volume of the tumor increases by approximately 37.25 mm3 for each day increase in the number of days after injection. Computer software and graphing calculators can calculate the least squares regression line. Does the intercept have meaning in this context? Why or why not?
Pomegranate study continued Predict the average volume of the tumor for 20 days after injection. Predict the average volume of the tumor for 5 days after injection. It is unknown whether the pattern observed in the scatterplot continues outside the range of x-values. Why? This is the danger ofextrapolation. The least squares line should notbe used to make predictions for y using x-values outsidethe range in the data set. Can volume be negative?
Why is the line used to summarize a linear relationship called the least squares regression line? This terminology comes from the relationship between the least squares line and the correlation coefficient. If r = 1, what do you know about the location of the points?
Why is the line used to summarize a linear relationship called the least squares regression line? What would happen if r = 0.4? . . . 0.3? . . . 0.2?
If you want to predict x from y, can you use the least squares line of y on x? The regression line of y on x should notbe used to predict x, because it is notthe line that minimizes the sum of the squared deviations in the x direction.
Assessing the Fit of a Line Residuals Residual Plots Outliers and Influential Points Coefficient of Determination Standard Deviation about the Line
Assessing the fit of a line Once the least squares regression line is obtained, the next step is to examine how effectively the line summarizes the relationship between x and y. Important questions are: • Is the line an appropriate way to summarize the relationship between x and y ? • Are there any unusual aspects of the data set that you need to consider before proceeding to use the least squares regression line to make predictions? • If you decide that it is reasonable to use the line as a basis for prediction, how accurate can you expect predictions to be? This section will look at graphical and numerical methods to answer these questions.
Distance traveled Distance to debris In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. If the point is below the line the residual will be negative. If the point is above the line the residual will be positive. Calculate the predicted y and the residuals.
Residual plots A careful look at the residuals can reveal many potential problems. A residual plot is a graph of the residuals. • A residual plot is a scatterplot of the (x, residual) pairs. • Residuals can also be graphed against the predicted y-values • Isolated points or a pattern of points in the residual plot indicate potential problems.
Deer mice continued Plot the residuals against the distance from debris (x)
Deer mice continued Are there any isolated points? Is there a pattern in the points? The points in the residual plot appear scattered at random. This indicates that a line is a reasonable way to describe the relationship between the distance from debris and the distance traveled.
Deer mice continued Residual plots can be plotted against either the x-values or the predicted y-values.
Residual plots continued Let’s examine the accompanying data on x = height (in inches) and y = average weight (in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts). The residual plot displays a definite curved pattern. The scatterplot appears rather straight. Even though r = 0.99, it is not accurate to say that weight increases linearly with height
Ifthe point affects the placement of the least-squares regression line, then the point is considered an influential point. Let’s examine the data set for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) Sketch a scatterplot with the fitted regression line. What would happen to the regression line if this point is removed? Do you notice anything unusual about this data set? This observation has an x-value that differs greatly from the others in the data set.
Black bears continued Notice that this observation falls far away from the regression line in the y direction. An observation is an outlierif it has a large residual.
Coefficient of Determination Suppose that you would like to predict the price of houses in a particular city from the size of the house (in square feet). There will be variability in house price, and it is this variability that makes accurate price prediction a challenge. If you know that differences in house size account for a large proportion of the variability in house price, then knowing the size of a house will help you predict its price. • The coefficient of determination is the proportion of variationin ythat can be attributed to an approximate linear relationship between x & y • Denoted by r2 • The value of r2 is often converted to a percentage.
Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food Suppose you didn’tknow any x-values. What distance would you expect deer mice to travel? Why do we square the deviations? To find the totalamount of variation in the distance traveled (y) you need to find the sum of the squares of these deviations from the mean. Total amount of variation in the distance traveled (y) is SSTo = 773.95 m2
Distance traveled Distance to debris Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food Now let’s find how much variation there is in the distance traveled (y) from the least squares regression line. Why do we square the residuals? The amount of variation in the distance traveled (y) from the least squares regression line is SSResid = 526.27 m2 To find the amount of variation in the distance traveled (y), find the sum of the squared residuals.