410 likes | 717 Views
AP Statistics. Topic 3 Summary of Bivariate Data. Overview of this Topic. In Topic 2 we studied the analysis of univariate data Now we will study bivariate data Observe 2 characteristics on each observational unit Graphically display the data Scatterplot Form, Strength and Direction
E N D
AP Statistics Topic 3 Summary of Bivariate Data
Overview of this Topic • In Topic 2 we studied the analysis of univariate data • Now we will study bivariate data • Observe 2 characteristics on each observational unit • Graphically display the data • Scatterplot • Form, Strength and Direction • Numerical summary of the data • Pearson’s Correlation Coefficient, • Coefficient of Determination, and • Least Squares Regression Line • Least square regression lines from data and from Minitab output
Scatterplots • Visual display of bivariate data • Describe the scatterplot • Form: Linear or non-linear • Direction: Positive, negative or random • Strength: Strong, Moderate, Weak • Look for clustering of data • Look for data points that fall far from the body of data • Context • We can incorporate qualitative info into our scatterplots
Pearson’s Correlation Coefficient • We’ve collected bivariate data and created a scatterplot • We’ve described our scatterplot • Now we’re ready to perform some numerical summaries • The first is Pearson’s Correlation Coefficient -- r
Pearson’s Correlation Coefficient • Pearson’s Correlation Coefficient – more often referred to as the correlation coefficient – is a statistic. • The Population Correlation Coefficient is a parameter – • r is a numerical measure of the strength of the linear association between the variables we are studying.
Properties of r • The value of r does not depend on the unit of measurement for either variable. • The value of r does not depend on which of the two variables is considered the independent variable. • The value of r is between +1 and -1. • r=+1/-1 when data points lie exactly on a straight line that slopes upward/downward. • The value of r is a measure of the extent to which the variables are linearly related • Correlation does not imply causation.
Calculating r • Output of running the LinReg function on the TI-83/84 • Corr program
Fitting a line • Up to now • We’ve collected data • Created a scatterplot • Described the scatterplot • Summarized linear association with r • Now we’d like to summarize our linear scatterplot with a straight line • Since we’ll be using this line for predictions, it now becomes important to identify an explanatory and response variable
How do we fit a ‘best line’ • Our line should be a good summary of our data and will be used as a prediction tool • We’d like a line that goes through all our data points • If that’s not possible, at least a line that is close to all the points
Sketch a Scatterplot • Sketch a scatterplot for the following data, fit a line that you think is a best fitting line • There are many lines that can be chosen that would be a reasonable fit • A standardized approach is essential so different analysts working with the same set of data will produce the same fits
Least Squares Regression Line • The least squares regression (LSR) is the most commonly method to find the best fit line • As the name implies, it is the line whose summed squared vertical deviations is the least among all possible lines
Least Squares Regression Line • Terminology • LSRL • Characteristics • is minimized • Passes through • close connection between correlation and slope
So what’s important here? • Identify explanatory and response variables • The form of your model • Determine the coefficients • Interpret the coefficients
So how do we calculate the coefficients ? • By hand? What, are you nuts? • We’ll always use the graphing calculator (or use a MINITAB output) • So let’s do a regression on the graphing calculator • And you’ll do a LSRL activity for homework
LSR on TI-83 • Identify the explanatory and response variables and enter the data into lists • Create your scatterplot • Now calculate your LSRL • STAT – CALC – LINREG (a+bx) – ENTER • X-list – comma – Y-list – comma – VARS – YVARS – Function – Y1 – ENTER -- ENTER • Now print you scatterplot – and the LSLR will also be graphed
Interpretation of Coefficients • Form of your LSLR • define your variables • Predicted Resp Var = a + b (Explan Var) • Interpretation of b • b is the slope of your LSLR • Basic Algebra I interpretation • Interpretation of a • a is the y-intercept • Basic Algebra I interpretation • Put your interpretation into the context of the problem • Sometimes the a value ‘makes sense’, other times it doesn’t – no practical interpretation
Assessing the fit of a line • So far we’ve collected bivariate data • We’ve displayed the data in a scatterplot • We’ve calculated the correlation coefficient • We found a best fit line using the method of least squares regression
Assessing the fit of a line • Now we want to assess the fit of our best fit line • Are there any unusual aspects of the data that we want to address before we use our line to predict? • How accurate can we expect our predictions based on the regression to be?
Residuals • We’re going to use the residuals created by the LSR to assess the fit, and • To give us a sense for how accurate our predictions will be
What are the residuals? • What are the residuals? • Where are they stored in our calculator?
Assessing the fit of a line • We create and inspect the residual plot for our regression • The residual plot is a scatterplot of the (x, residual) pairs • What are we looking for? • Isolated points or patterns in the residual plot indicate potential problems
Why the residual plot? • Why do we inspect the residual plot and not just the scatterplot of our original data? • Sometimes it’s easier to see curvature or problem points in a residual plot. • Heights and Weights of American Women example
To summarize … • We determine the appropriateness of our least squares regression line by inspection of the residual plot • We look for patterns and unusual values • RANDOMNESS = GOOD • PATTERNS/UNUSUAL VALUES = BAD
Assess accuracy of line • Once we determine that our LSR line is an appropriate summary of our data, we want to assess the accuracy of predictions • Are predictions made with the additional variable better than predictions without knowledge of the second variable?
Two numerical measures • The two numerical measures we use in this part of our assessment are • Coefficient of Determination • Standard deviation about the regression line
Coefficient of Determination • The Coefficient of Determination provides the proportion of variability that is explained by the LSRL – that is the proportion of variation attributed to the linear association between the two variables. • It is one of the outputs when we run our regression. • Let’s digress …
Let’s say … • Let’s use our post knee surgery range of motion data • Let’s say we wanted to predict range of motion – but only had the range of motion data – not the ages of the patients • How?
Now let’s say … • You also have additional information – the ages of the patients • We’ve already seen that there is a linear association between age and range of motion • When we fit a line to this bivariate data, how do the residuals compare? • SSR vs SST
Coefficient of Determination • If we look at the ratio of SSR to SST, can you interpret this? • Now subtract this value from 1 – now give me an interpretation of this value • It’s equal to R-sq – Coefficient of Determination
Standard Deviation about the LSRL • The CoD measures the extent of variation about the best fit line to the overall variation in y. • A high R-sq value does not promise that the deviations from the line are small in an absolute sense. • The Standard Deviation about the LSRL – se – is the typical amount an observation deviates from the LSRL
Standard Deviation about the LSRL • This value is analogous to the standard deviation
Minitab Output • Minitab is a desktop statistics software package • You are responsible for reading and using Minitab output • Things you can be expected to perform: • Determine LSR equation • Determine se • Determine R-sq • Determine R-sq from ANOVA table
Non-linear Relationships and Transformations • If our data has a linear pattern in a scatterplot, our approach is to summarize the data as a line using the method of least squares • What if our data does not have a linear pattern – as depicted in the scatterplot or the residual plot • Look at the ‘Rice Paddy’ data (L1, L2)
Rice Paddy • Is the time between flowering and harvesting related to the yield of a paddy?
Nonlinear Relationships • We have 2 choices • We can fit a nonlinear regression • Transform the data • Ex: Nonlinear regression for the ‘Rice Paddy’ data, inspect the residuals, R2
Transforming the Data • An alternative is to transform the data so the transformed data scatterplot is linear, • And then perform a linear regression • We then ‘back-transform’ our data when we use the LSRL for predictions.
Let’s look at an example • Ex: River Velocity and Distance from Shore • Is it reasonable to fit a line to this data? • If we wanted to transform our data • Which variable to transform ? • What type of transformation ?
River Water Velocity • As fans of white-water rafting know, a river flows more slowly close to its banks. Let’s look at the relationship between water velocity (cm/sec) and distance from shore (m).
Transformations • Which variable to transform ? • Explanatory and/or Response variable • Transform and then redraw scatterplot • Which transformation to use ? • Use the transformation ladder and the shape of your scatterplot • Transform the data, inspect the scatterplot and follow the steps for LSR • Use your LSR to make a prediction • What is the velocity of the river 9 m from the bank?
Let’s do another example • In the previous problem, we only transformed the explanatory variable. Consequently the ‘back tranformation’ was easy. • Let’s do another example where we transform both variables, run a regression and then make a prediction. • Tortilla Chips: Relationship between Frying Time and Moisture Content
Tortilla Chips • No tortilla chip lover likes soggy chips, so it’s important to find characteristics of the production process that produce chips with an appealing texture. The following data on frying time (sec) and moisture content (%) are given below