1 / 27

Scatterplots

Learn how to interpret scatterplots, identify outliers and influential points, and avoid misleading impressions of data. Explore the form, direction, and strength of associations between variables.

scholz
Download Presentation

Scatterplots

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scatterplots Image from Minitab Website

  2. Learning Objectives By the end of this lecture, you should be able to: • Describe what a scatterplot is • Be comfortable with the terms explanatory variable and response variable • Describe a scatterplot in terms of form, direction, and strength • Define what is meant by an “outlier” and “influential point” (in terms of a scatterplot), and how you might identify them • Recognize why poorly chosen scales on a scatterplot can give misleading impressions of the data

  3. Examining Relationships Up to this point, we have focused on single-variable (“univariate”) data. e.g. Women’s heights, Percentage of Hispanics in each state, SAT scores, etc. Much of statistical analysis involves looking at the relationship between two or more variables. For example, we may be interested in the relationship between the number of beers people consumed at a party and their resulting blood alcohol level (BAC). With the proper statistical tools we can try to determine things like: • IS there a relationship? That is, does the number of beers truly affect blood alcohol level? • If there is a relationship, can we predict how the quantity of beer consumed affects the to BAC. A human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known evolutionary flaw in the human brain and we should be aware of it. We will discuss this topic in more detail as we progress through the course.

  4. Here, we have two quantitative variables for each of 16 students (n=16). 1) How many beers each student drank 2) The blood alcohol level (BAC) of each student after consuming those beers We are interested in the relationship between the two variables: How is one variable affected by changes in the other variable?

  5. Looking for relationships between variables • Always start with a graph (if possible) • Hopefully this detail is becoming increasingly obvious to you! • Look for • An overall pattern • Deviations from the pattern (deviations such as outliers are sometimes the most interesting part!) • If appropriate, try to provide both descriptive and numerical descriptions about the data and pattern.

  6. Scatterplots In a scatterplot, each axis is used to represent each of the variables, and the data are plotted as points on the graph.

  7. Explanatory and response variables A response variablemeasures or records an outcome of a study. An explanatory variableexplains (“causes”) the changes in the response variable. Which variable should go on which axis? Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis. y axis Blood Alcohol Content (Response variable) x axis Number of Beers (Explanatory Variable)

  8. Terminology: Dependent / Independent • Instead of explanatory / response, you may often encounter the terms independent and dependent. • Independent for Explanatory • Dependent for Response • They are pretty much interchangeable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. • You will occasionally see SPSS use dependent/intendent.

  9. Which variable should be the explanatory, and which the response? • The variable from which you are trying to predict the change in the other variable should be the explanatory variable. • (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned, there is a subtle distinction between them which we may discuss at a later point). • The variable that gets changed in response to changes in the explanatory variable (i.e. “responds” to the explanatory variable), is the response variable. • Example: • Exercise v.s. Calories burned? • Answer: If in your analysis, you are trying to predict or analyze the number of calories burned as a result of exercise, then exercise would be the explanatory variable, and calories burned would be the response variable. • Exam Score v.s. Hours studying • Answer: If we are trying to predict or analyze the scores on an exam as a result of studying, then “hours studying” would be the explanatory variable, and exam score would be the response variable.

  10. Describing Scatterplots • Much in the same way we describe a single variable’s distribution in terms of its distribution, center, spread, etc, we should also be able to describe a scatterplot. • When describing a scatterplot, we describe the relationship by examining the form,direction, and strength of the association. We look for an overall pattern … • Form: linear (a straight line), curved, clusters, no pattern • Direction: positive, negative, no direction • Strength: how closely the points fit the “form”

  11. Form of an associationLinear / Nonlinear / No Relationship Linear No relationship Nonlinear

  12. Direction of a linear association Positive or Negative If a relationship is linear, itis given a directional description of Positive or Negative Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. Again, note that we only describe the direction of the relationship when the relationship is linear.

  13. Scatterplot Direction: No Relationship Sometimes there isn’t any relationship:X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe this as “No relationship”.

  14. Scatterplot: Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. ? ? ? With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values. (You could probably make a reasonable argument that the relationship of this plot isn’t even linear.)

  15. Strong or Weak Relationship? This looks like a reasonably strong relationship. The daily amount of gas consumed can be predicted pretty accurately for a given temperature value. This is a relatively weak relationship. For a particular state median household income, you can’t predict the state per capita income very well.

  16. Describing the strength • For now we are using the admittedly vague terms ‘strong, moderate, weak’. • In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength of a linear relationship between two variables.

  17. Describing/Interpreting scatterplots • As mentioned earlier, when you are asked to interpret a scatterplot, you should be familiar with these 3 terms in particular. • Form: linear, curved, clusters, no pattern • Direction: positive, negative, no direction • Strength: strong, moderate, weak • Note: Recall that if the relationship is not linear, we will not bother to describe direction or strength.

  18. Examples – Describe each plot • Form: Linear, Direction: positive, Strength: strong • Form: Linear, Direction: negative, Strength: moderate • Form: No relationship. Examining any particular x tells us nothing about y. As a result, the terms ‘positive/negative’ don’t apply. Neither does the strength.

  19. Examples • Form: Non-linear. Therefore, we don’t bother trying to describe direction or strength. • Form: Linear, Direction: positive, Strength: moderate • In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.

  20. Lying with Statistics: How (not) to scale a scatterplot Same data in all four plots! • Using an inappropriate scale for a scatterplot can give an incorrect impression. • Ideally, both variables should be given a similar amount of space: • Plot roughly square • Points should occupy most of the plot space In other words, if faced with this group plots, you should be suspicious of most of them!

  21. Outliers on Scatterplots An outlier is a data value that has a very low probability of occurrence (i.e., that particular observation is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the pattern of the relationship. This scatterplot appears to show a linear relationship between the two variables. The observation at the upper right is consistent with the linear relationship and therefore would not be considered an outlier. This plot also appears to show a linear relationship. However, the observation at approximately (7,2) is not consistent with the linear relationship. i.e. It appears to be an outlier.

  22. Outliers? The upper right-hand point here is not an outlier of the relationship—It is what you would expect for this many beers given the linear relationship between beers/weight and blood alcohol. This point is not consistent with the relationship, so we would label it as an outlier.

  23. IQ score and • Grade point average • Describe in words the purpose of this plot: • It is there to help us determine if there is a relationship between IQ score and GPA. • Describe the shape, direction, and strength: • Shape: linear • Direction: positive • Strength: appears somewhat weak • Outliers present? • Appear to be outliers, but it is hard to say definitively.

  24. IQ score and Grade point average Are there outliers present? The circled datapoints (and perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide? Recall that on a scatterplot that is showing a linear relationship, we consider a datapoint to be an outlier if it is way off the regression line (the line through the data points). If the regression line looks like the one here, then we would probably label these two observations as outliers.

  25. IQ score and Grade point average Are there outliers present? If the regression line looks like the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.

  26. IQ score and Grade point average Suppose that we have a different regression line as shown here. Are there outliers present? If the regression line looks like the one drawn here, then the upper circled datapoint (and probably some of others nearby as well) could be considered outliers. But the lower one would not be.

  27. WHICH line, then, is the “correct” regression line? Answer: As with other models we have discussed (e.g. density curves to summarize a histogram) we use a mathematical formula to draw regression lines. (We don’t just “eyeball it”!) We will discuss this topic in our next lecture on scatterplots.

More Related