300 likes | 431 Views
Notes Bivariate Data Chapters 7 - 9. Bivariate Data Explores relationships between two quantitative variables. The explanatory variable attempts to explain the observed outcomes. (In algebra this is your independent variable – “x”).
E N D
Bivariate Data Explores relationships between two quantitative variables.
The explanatory variable attempts to explain the observed outcomes. (In algebra this is your independent variable – “x”)
The response variable measures an outcome of a study. (In algebra this is your dependent variable – “y”)
When we gather data, we usually have in mind which variables are which. • Beware! – this explanatory/response relationship suggests a cause and effect relationship that may not exist in all data sets. Use common sense!!
A Lurking Variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied. • Lurking variables can suggest a relationship when there isn’t one or can hide a relationship that exists.
Displaying the Variables • We always graph our data right? • You use a scatterplot to graph the relationship between 2 quantitative variables. Each point represents an individual.
Remember that not all bivariate relationships are linear!!! We will talk about non-linear in the next unit.
Interpret a Scatterplot • Here is what we look for: • 1) direction (positive, negative) D • 2) form (linear, or not linear) S • 3) strength (correlation, r) S • 4) deviations from the pattern (outliers) U SUDS!!
Remember on outlier is an individual observation that falls outside the overall pattern of the graph. • There is no outlier test for bivariate data. It’s a judgment call
Categorical variables can be added to scatterplots by changing the symbols in the plot. (See P. 199 for examples) • Visual inspection is often not a good judge of how strong a linear relationship is. Changing the plotting scales or the amount of white space around a cloud of points can be deceptive. So….
Facts about Correlation: • 1) positive r – positive association (positive slope) negative r – negative association (negative slope) • 2) r must fall between –1 and 1 inclusive. • 3) r values close to –1 or 1 indicate that the points lie close to a straight line. • 4) r values close to 0 indicate a weak linear relationship. • 5) r values of –1 or 1 indicate a perfect linear relationship. • 6) correlation only measures the strength in linear relationships (not curves). • 7) correlation can be strongly affected by extreme values (outliers).
Least-Squares Regression Line • The least-squares regression line (LSRL) is a mathematical model for the data. • This line is also known as the line of best fit or the regression line.
Formal definition… • The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Why do we do regression? • The purpose of regression is to determine a model that we can use for making predictions.
Communication is always the goal!!! • When we write the equation for a LSRL we do not use x & y, we use the variable names themselves… • For example: • Predicted score = 52 + 1.5(hours studied)
Another measure of strength… • The coefficient of determination, r2, is the fraction of the variation in the value of y that is explained by the linear model. • When we explain r2then we say… ___% of the variability in ___(y) can be explained by this linear model.
Deviations for single points • A residual is the vertical difference between an actual point and the LSRL at one specific value of x. That is, Residual = observed y – predicted y or Residual = y – • The mean of the residuals is always zero.
A new plot… • A residual plot plots the residuals on the vertical axis against the explanatory variables on the horizontal axis. • Such a plot magnifies the residuals and makes patterns easier to see.
Why do I need a residual plot? • Remember that all data is not linear in shape!!! The residual plot clearly shows if linear is appropriate. • A residual plotshow good linear fit when the points are randomly scattered about y = 0 with no obvious patterns.
To create a residual plot on the calculator: • 1)You must have done a linear regression with the data you wish to use. • 2) From the Stat-Plot, Plot # menu choose scatterplot and leave the x list with the x values. • 3) Change the y-list to “RESID” chosen from the list menu. • 4) Zoom – 9
In scatterplots we can have points that are outliers or influential points or both. • An observation can be an outlier in the x direction, the y direction, or in both directions. • An observation is influential if removing it or adding it) would markedly change the position of the regression line.
Extrapolation is the use of a regression model for prediction outside the domain of values of the explanatory variable x. • Such predictions cannot be trusted.
Association vs. Causation • A strong association between two variables is NOT enough to draw conclusions about cause & effect.
Association vs Causation • Strong association between two variables x and y can reflect: • A) Causation – Change in x causes change in y • B) Common response – Both x and y are Responding to some other unobserved factor • C) Confounding – the effect on y of the explanatory variable x is hopelessly mixed up with the effects on y of other variables.
Association vs Causation • Cause and Effect can only be determined from a well designed experiment.
Data with no apparent linear relationship can also be examined in two ways to see if a relationship still exists: • 1) Check to see if breaking the data down into subsets or groups makes a difference. • 2) If the data is curved in some way and not linear, a relationship still exists. We will explore that in the next chapter.