800 likes | 955 Views
Numeric Measures. Why?Kim is in an introductory history class. On the midterm exam Kim scored 64 out of 100? Did she do well?The class average was a 42. By knowing the average for the class we can make a comparison.. Numeric Measures. Allow us to make comparisonsOf individuals to the groupOf
E N D
1. Measures of CenterLecture 3 William F. Hunt, Jr.
Stat 361
2. Numeric Measures Why?
Kim is in an introductory history class. On the midterm exam Kim scored 64 out of 100? Did she do well?
The class average was a 42.
By knowing the average for the class we can make a comparison. By knowing this average of the class we can make a comparison. We can know from this data average that the score was not so bad. By knowing this average of the class we can make a comparison. We can know from this data average that the score was not so bad.
3. Numeric Measures Allow us to make comparisons
Of individuals to the group
Of group to other groups
Measures of center
Give an idea about the main chunk of the data Knowing numeric measures of our data set allow us to make comparisons. These comparisons allow us to compare individuals to the group or they also allow us to compare groups to each other. For instance we might compare the means of the exam scores for two classes. So if we class 1 had an average of 42 and class 2 had an average of 60 we know that class 2 scored higher. We can do this using measures of center. These give us an idea of the main chunk of the data. Knowing numeric measures of our data set allow us to make comparisons. These comparisons allow us to compare individuals to the group or they also allow us to compare groups to each other. For instance we might compare the means of the exam scores for two classes. So if we class 1 had an average of 42 and class 2 had an average of 60 we know that class 2 scored higher. We can do this using measures of center. These give us an idea of the main chunk of the data.
4. Measures of Central Tendency Mean-average
Notation:
Population mean: ? “mu”
Sample mean: “y-bar” We will begin talking about the average or what we call in this class the “mean”. You have been calculating the average since high school. We will not spend a large amount of time reviewing these calculations since they are easily done using a calculator. We will spend some time talking about notation. Recall that we have both a population and a sample that need to be summarized. We will need notation to keep these separate. For a population mean we will use the lower case Greek letter mu. For the sample mean we will use a lower case x with a bar over it and call it x-bar.We will begin talking about the average or what we call in this class the “mean”. You have been calculating the average since high school. We will not spend a large amount of time reviewing these calculations since they are easily done using a calculator. We will spend some time talking about notation. Recall that we have both a population and a sample that need to be summarized. We will need notation to keep these separate. For a population mean we will use the lower case Greek letter mu. For the sample mean we will use a lower case x with a bar over it and call it x-bar.
5. Measures of Central Tendency Summation Notation
We will also talk about summation notation. You may recall from your algebra class that a capital Greek letter sigma indicates “summation.”
This is a short cut way to say “add up all the values”. Later on this will become a useful shortcut when we write out some more complex things. We will also talk about summation notation. You may recall from your algebra class that a capital Greek letter sigma indicates “summation.”
This is a short cut way to say “add up all the values”. Later on this will become a useful shortcut when we write out some more complex things.
6. Measures of Central Tendency Summation Notation
This is a short cut way to say “add up all the values”. Later on this will become a useful shortcut when we write out some more complex things.
This is a short cut way to say “add up all the values”. Later on this will become a useful shortcut when we write out some more complex things.
7. Measures of Central Tendency Summation Notation
The y with the subscript “i” is a shortcut way of listing the individuals. So y1 just means the first persons value, y2 the second persons value and so forth. “n” is the sample size so yn indicates the last person in the group.
The y with the subscript “i” is a shortcut way of listing the individuals. So y1 just means the first persons value, y2 the second persons value and so forth. “n” is the sample size so yn indicates the last person in the group.
8. Measures of Central Tendency This notation gives us a shortcut way of presenting a formula for the mean of y. To calculate the mean of y we will take the sum of the y’s and divide by the sample size. This notation gives us a shortcut way of presenting a formula for the mean of y. To calculate the mean of y we will take the sum of the y’s and divide by the sample size.
9. Measures of Central Tendency The average is the sum of the values divided by the sample size. This gives us a short cut notation for the calculation of the average.The average is the sum of the values divided by the sample size. This gives us a short cut notation for the calculation of the average.
10. Measures of Central Tendency Median- Middle value in a data set when values are put in increasing order
50% of values above and 50% below
If even number of observations just average middle two.
Although the average is the most widely used measure of center another measure of center that we often consider is called the median. The median is the value that has half the data above it and half the data below it. We find it by sorting the values from smallest to largest and then choosing the middle value. As you may remember from a high school class if there is an even number of observations we just average the middle two values. Although the average is the most widely used measure of center another measure of center that we often consider is called the median. The median is the value that has half the data above it and half the data below it. We find it by sorting the values from smallest to largest and then choosing the middle value. As you may remember from a high school class if there is an even number of observations we just average the middle two values.
11. Simple Example: A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40. Lets take that simple example that was given earlier, the amount of soda consumed by a group of teenagers. Lets calculate the mean of these values.Lets take that simple example that was given earlier, the amount of soda consumed by a group of teenagers. Lets calculate the mean of these values.
12. Simple Example: A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40.
Mean: 15
The mean can be calculated by using the formulas we examined earlier. The mean is found by summing the values and then dividing by the number of observations. We can use this example to illustrate the formula we described earlier. The summation of y means that we will add each individual value of y. The sum of these values is 105. We have seven observations in our sample so we will divide by n or in other words divide by 7. Dividing 105 by 7 gives us the average of 15. The mean can be calculated by using the formulas we examined earlier. The mean is found by summing the values and then dividing by the number of observations. We can use this example to illustrate the formula we described earlier. The summation of y means that we will add each individual value of y. The sum of these values is 105. We have seven observations in our sample so we will divide by n or in other words divide by 7. Dividing 105 by 7 gives us the average of 15.
13. Simple Example: Soda consumed
Median:
In increasing order 6 9 9 12 14 15 40 The mean is often not near the middle of the data. In such cases we often calculate the median. The median for this data is found by arranging the data in increasing order from smallest to largest. The mean is often not near the middle of the data. In such cases we often calculate the median. The median for this data is found by arranging the data in increasing order from smallest to largest.
14. Simple Example: Soda consumed
Median: 12
In increasing order 6 9 9 12 14 15 40 We then pick the middle value from this list. In this case the median is 12. Notice that the median has the same number of values above as below. If we want a representative value for this data it might be more appropriate to think of the median rather than the mean. We then pick the middle value from this list. In this case the median is 12. Notice that the median has the same number of values above as below. If we want a representative value for this data it might be more appropriate to think of the median rather than the mean.
15. Mean vs Median A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40. If we look back at the dot plot we created we see how the mean and median compare in this data. The mean is at 15. If we look back at the dot plot we created we see how the mean and median compare in this data. The mean is at 15.
16. Mean vs Median A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40. The mean is at 15. Although this is near the main cluster of values the large unusual value (40) pulls the mean up from the main cluster. A more representative value might be the median.The mean is at 15. Although this is near the main cluster of values the large unusual value (40) pulls the mean up from the main cluster. A more representative value might be the median.
17. Mean vs Median A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40. The median is more representative than the mean. As we can see the mean has been pulled in the direction of the outliers.The median is more representative than the mean. As we can see the mean has been pulled in the direction of the outliers.
18. Problem with the mean: Sensitive to unusual values and skewed data
pulled away from the median
Skewed Right
Mean greater than median
Skewed left
Mean less than median.
Symmetric
Mean and median are about the same. This data set illustrates the major difficulty with the mean. If a data set is skewed or has outliers the mean will be pulled in the direction of the long tail or the outlier relative to the median. If the data is symmetric the mean and median will be approximately the same.This data set illustrates the major difficulty with the mean. If a data set is skewed or has outliers the mean will be pulled in the direction of the long tail or the outlier relative to the median. If the data is symmetric the mean and median will be approximately the same.
19. Trimmed Mean A compromise between the average and the median.
Less sensitive to outliers.
Observations are ordered from smallest to largest.
A trimming percentage 100r% is chosen where r is a number between 0 and 0.5.
Suppose r=0.1, so that the trimming percentage is 10%. Then if n=20, 10% of 20 is 2: the trimmed mean results from deleting (trimming) the largest 2 observations and the 2 smallest.
20. CoalEmissions Uncertainty Project (2009-10), Alissa Anderson, Colin Geisenhoffer, Brody Heffner, Michael Shaw & Emily Wisner After 2% Trim Before 2% Trim
21. Class Problem The following 10 observations on October snow cover for Eurasia during the years 1970-1979 (in million km2):
What would you report as a representative or typical value of October snow cover from this period, and what prompted your choice?
22. Why? Why do we want to know these values?
We can use these values to compare groups.
Example: The simple example we examined were students from one particular school. Another sample was taken of students in another district. Their median was 18. We know that first group of students typically drank less.
We have seen how means and medians can be found and how they are related in distributions. But many people spend much of their time on the calculations rather than remembering the point of finding these values. We need these measures to make comparisons. If we consider the data from our simple example we can compare the soda consumption of our group with subjects in another school district. In the other district the median was 18 ounces. From this we can compare the median of our group which was 12 with the 18. We know that our group typically consumed less soda.
We have seen how means and medians can be found and how they are related in distributions. But many people spend much of their time on the calculations rather than remembering the point of finding these values. We need these measures to make comparisons. If we consider the data from our simple example we can compare the soda consumption of our group with subjects in another school district. In the other district the median was 18 ounces. From this we can compare the median of our group which was 12 with the 18. We know that our group typically consumed less soda.
23. Measures of Variability
24. Measures of Variability Why?
Tell us about consistency and predictability
Allow comparison of groups
Gives scale of reference to compare individuals In this section we will consider measures of variability. Measures of variability will be used much like the measures of center in the previous section in that they will give us a way to compare groups in terms of consistency and predictability. As we begin our discussions it is important to remember that the concentration should not be the calculation of the values but why we need the values.In this section we will consider measures of variability. Measures of variability will be used much like the measures of center in the previous section in that they will give us a way to compare groups in terms of consistency and predictability. As we begin our discussions it is important to remember that the concentration should not be the calculation of the values but why we need the values.
25. Measures of Variability Range-difference in maximum and minimum
How spread out are the values
Soda Amounts: Range = 40-6=34 The most basic measure of variability is the range. The range is just the difference in the maximum and the minimum. The range is a single number that summarizes the difference between the two values. For our simple example from earlier we can find the range for the amount of soda consumed by a group of teenagers. The range for this data is 34. Some people often make the mistake of saying it is 6 to 40. The range as we use it is a single number. The most basic measure of variability is the range. The range is just the difference in the maximum and the minimum. The range is a single number that summarizes the difference between the two values. For our simple example from earlier we can find the range for the amount of soda consumed by a group of teenagers. The range for this data is 34. Some people often make the mistake of saying it is 6 to 40. The range as we use it is a single number.
26. Measures of Variability Problem: Range only looks at two values.
Does not quantify spread of the others.
Solution: Look at all values => How far are they from mean
Variance- summarizes distance between all individuals and the mean As we can see from the three examples you just looked at the range has a downside. The range only considers two values. Data sets can look very different and have the same range. We need to have a measure that quantifies the spread of the data and looks at all the values. The variance is such a measure. As we can see from the three examples you just looked at the range has a downside. The range only considers two values. Data sets can look very different and have the same range. We need to have a measure that quantifies the spread of the data and looks at all the values. The variance is such a measure.
27. Measures of Variability Important notation:
Population variance: ?2 “sigma squared”
Sample variance: s2 As with the mean we have a different notation for the population variance and the sample variance. For the population we use the lower case Greek letter sigma with a square on it. This is called “sigma squared”. When we look at a sample variance we typically use the lower case letter s again with a square. As with the mean we have a different notation for the population variance and the sample variance. For the population we use the lower case Greek letter sigma with a square on it. This is called “sigma squared”. When we look at a sample variance we typically use the lower case letter s again with a square.
28. Measures of Variability Important Formula: We can use the notation we examined earlier to look at a formula for the variance and see how it is calculated. The variance looks at each value and measures its distance from the mean. Calculation of the population variance is done by considering the difference between the values and the mean. When we do this we have some values that are positive and some that are negative. If you remember from high school algebra one way to get rid of negative values is to square the values. We can use the notation we examined earlier to look at a formula for the variance and see how it is calculated. The variance looks at each value and measures its distance from the mean. Calculation of the population variance is done by considering the difference between the values and the mean. When we do this we have some values that are positive and some that are negative. If you remember from high school algebra one way to get rid of negative values is to square the values.
29. Measures of Variability Important Formula: We can then average these squared distances. We average by summing the values and dividing by how many there are. Notice we use a capital n in this calculation, it is typically the symbol used for the population size. We can then average these squared distances. We average by summing the values and dividing by how many there are. Notice we use a capital n in this calculation, it is typically the symbol used for the population size.
30. Measures of Variability Sample Variance For the sample we follow basically the same procedure. Notice in our formula we have the same calculations but now everything is in terms of sample values. Rather than “mu” we have “y-bar” . Again the idea is that we calculate the average squared distance from the mean for each value in our data set. For the sample we follow basically the same procedure. Notice in our formula we have the same calculations but now everything is in terms of sample values. Rather than “mu” we have “y-bar” . Again the idea is that we calculate the average squared distance from the mean for each value in our data set.
31. Measures of Variability Sample Variance Notice that in this case we divide by n-1 rather than N. Notice that in this case we divide by n-1 rather than N.
32. Measures of Variability Sample Variance
33. Simple Example: A health researcher examined the amount of soda that a group of teenagers consumed during a day. The resulting amounts in ounces were: 9, 9, 6, 15, 12, 14, and 40. Lets consider again our simple example from earlier. We can calculate the variance of this sample using a few basic steps. I typically set up a basic table with three columns. Lets consider again our simple example from earlier. We can calculate the variance of this sample using a few basic steps. I typically set up a basic table with three columns.
35. Question for you
36. What does it tell us? By itself… not much.
Some people try lots of “tricks” to try to recreate the data set from this number.
The purpose of the number is to make a comparison with other data sets.
Example: Another group of teens had soda consumption that had a variance of 473.2.
Other group was more spread out than our group. What does a variance of 131.33 tell us? Not much. A lot of people have many mathematical tricks to try to figure out what data you had from the variance. Although this might be of interest to a mathematician it by itself does not tell us much. Remember that the numbers are used for comparisons of groups. The idea of the variance, range and the other values we will discuss are to make comparisons. Our data set had a variance of 131.33. another group of teens had a variance of 473.2. The other group was more spread out than our group.What does a variance of 131.33 tell us? Not much. A lot of people have many mathematical tricks to try to figure out what data you had from the variance. Although this might be of interest to a mathematician it by itself does not tell us much. Remember that the numbers are used for comparisons of groups. The idea of the variance, range and the other values we will discuss are to make comparisons. Our data set had a variance of 131.33. another group of teens had a variance of 473.2. The other group was more spread out than our group.
37. The Standard Deviation Variance is not on the same scale as the original data.
Standard Deviation – square root of the variance.
Has the same units as original data
Allows more direct comparisons Many people do not like the variance because it is not on the same scale as the original data. For instance our variance of 131.33 has units ounces squared. Square ounces are not really something that we understand. From high school algebra we know that to get rid of a square we take the square root. The square root of the variance is referred to as a standard deviation. The standard deviation like the variance measures the spread of the data. But it is on the same scale as the original data and allows more direct comparisons.Many people do not like the variance because it is not on the same scale as the original data. For instance our variance of 131.33 has units ounces squared. Square ounces are not really something that we understand. From high school algebra we know that to get rid of a square we take the square root. The square root of the variance is referred to as a standard deviation. The standard deviation like the variance measures the spread of the data. But it is on the same scale as the original data and allows more direct comparisons.
38. The Standard Deviation For amount of soda
For our soda example we can take the square root of the variance. The square root of 131.33 is 11.46.For our soda example we can take the square root of the variance. The square root of 131.33 is 11.46.
39. What does it tell us? Understand variability in the data.
Which is more consistent. Recall that we can use the measures of variation to help us understand the variability in the data. Again the point is not the calculations, it is using the values to interpret the data and discuss the story behind the data. To think about that lets consider an example of making this comparison.
Recall that we can use the measures of variation to help us understand the variability in the data. Again the point is not the calculations, it is using the values to interpret the data and discuss the story behind the data. To think about that lets consider an example of making this comparison.
40. City Temperature Lets consider the daily temperatures for various cities around the country. Lets start with Raleigh NC. What is the temperature like in Raleigh? Lets consider the daily temperatures for various cities around the country. Lets start with Raleigh NC. What is the temperature like in Raleigh?
41. City Temperature We have data about the daily temperature for many cities around the country. For Raleigh if we average the temperature for the entire year the average is 59, median is 61 and the stdev is 15. Again the point of these numbers is to make comparisons. People might look at a particular day and see that the temperature is 90 we would know that it is above average. However, with the standard deviation we really need to know another standard deviation to help make it useful. We have data about the daily temperature for many cities around the country. For Raleigh if we average the temperature for the entire year the average is 59, median is 61 and the stdev is 15. Again the point of these numbers is to make comparisons. People might look at a particular day and see that the temperature is 90 we would know that it is above average. However, with the standard deviation we really need to know another standard deviation to help make it useful.
42. City Temperature Lets take another city… Fargo ND. What is Fargo like? Lets take another city… Fargo ND. What is Fargo like?
43. City Temperature The average temperature for the year is 42, median 43 and standard deviation is 24. in comparing the average or median with Raleigh we see that Fargo is generally colder. From the standard deviation we see that Fargo is more varied than Raleigh. The average temperature for the year is 42, median 43 and standard deviation is 24. in comparing the average or median with Raleigh we see that Fargo is generally colder. From the standard deviation we see that Fargo is more varied than Raleigh.
44. City Temperature Lets take another city Fairbanks Alaska. We might guess that Fairbanks is colder than Fargo or Raleigh. Lets take another city Fairbanks Alaska. We might guess that Fairbanks is colder than Fargo or Raleigh.
45. City Temperature From the summary statistics we see that is true. We also see that the standard deviation is larger than either of the other two cities. From this we can conclude that Fairbanks would be more diverse in temperature than either Raleigh or Fargo. Fairbanks has very cold winters and warm times. From the summary statistics we see that is true. We also see that the standard deviation is larger than either of the other two cities. From this we can conclude that Fairbanks would be more diverse in temperature than either Raleigh or Fargo. Fairbanks has very cold winters and warm times.
46. City Temperature As a final city lets consider Honolulu Hawaii. What is Honolulu like??? It is nice. It is warm there and it is warm year round. We might guess that the average temperature will be higher. And we might also consider that there is less variability in temperature. As a final city lets consider Honolulu Hawaii. What is Honolulu like??? It is nice. It is warm there and it is warm year round. We might guess that the average temperature will be higher. And we might also consider that there is less variability in temperature.
47. City Temperature We see that Honolulu has an average temperature of 77, median 77 and a much smaller standard deviation of 3. This reflects the consistency in temperature that we would expect for Honolulu. The point of this example is that buy knowing these numbers we can see part of the story of the data and start to interpret what is going on with the situation. We see that Honolulu has an average temperature of 77, median 77 and a much smaller standard deviation of 3. This reflects the consistency in temperature that we would expect for Honolulu. The point of this example is that buy knowing these numbers we can see part of the story of the data and start to interpret what is going on with the situation.
48. Coefficient of Variation Coefficient of Variation (CV) – ratio of standard deviation to mean
Used to compare variability when scales are very different. Some fields of study also make use of the coefficient of variation or the CV. The CV is the ratio of the standard deviation to the mean. It is a value that can be used to compare the variability of groups when they are measured on different scales. Some fields of study also make use of the coefficient of variation or the CV. The CV is the ratio of the standard deviation to the mean. It is a value that can be used to compare the variability of groups when they are measured on different scales.
49. Example: Students in a midwestern state take a end of grade exam that has a maximum of 100 points. A class testing a new teaching method had a standard deviation of 10.
Students in an east coast state take an end of grade exam that has a maximum of 500 points. A class testing the new teaching method had a standard deviation of 30. Which is more varied? Lets take an example to see why the CV is useful. An educational researcher is examining a new teaching method. He recruits a group of students at very different locations. He would like to compare the groups using end of grade exams. One group is from the midwest where they take an end of grade exam that has a maximum of 100. The standard deviation is 10. The other group from the east coast takes an EOG that has a standard deviation of 30. If we just look at the standard deviations we might think that the east coast group was much more varied. But this might be because the scores from the east coast school are on a scale that gives very big numbers.Lets take an example to see why the CV is useful. An educational researcher is examining a new teaching method. He recruits a group of students at very different locations. He would like to compare the groups using end of grade exams. One group is from the midwest where they take an end of grade exam that has a maximum of 100. The standard deviation is 10. The other group from the east coast takes an EOG that has a standard deviation of 30. If we just look at the standard deviations we might think that the east coast group was much more varied. But this might be because the scores from the east coast school are on a scale that gives very big numbers.
50. Example The mean for the midwestern state was 70.
The mean for the east coast state was 350.
For a more honest comparison we might look at the CV for both groups. The average for the midwest state was 70. The CV would be 10 divided by 70 or 14.3%. The standard deviation is 14.3% of the mean. For the east coast state had an average of 350, the CV was 8.6% So we see that in terms of variability we have larger variation in the midwestern state.For a more honest comparison we might look at the CV for both groups. The average for the midwest state was 70. The CV would be 10 divided by 70 or 14.3%. The standard deviation is 14.3% of the mean. For the east coast state had an average of 350, the CV was 8.6% So we see that in terms of variability we have larger variation in the midwestern state.
51. Question
52. 52 Transformations Often we change units of the data. What happens?
Change feet to centimeters?
Pounds to kilograms
Add 20 points to each students score.
53. 53 Transformations Multiplying each value in a data set by a constant multiplies the mean and standard deviation by the same amount. The variance is multiplied by the square of the constant.
Variance is on the squared scale
Mean and SD are on scale of the data.
54. 54 Transformations Adding the same value to each item in a data set changes the mean by that amount but does not change the standard deviation or variance.
Everything shifts together.
Spread of the items does not change.
55. Example Soda consumption: The soda amounts were given in ounces. One fluid ounce is equal to 29.57 milliliters. If we had measured the amounts in milliliters what would the mean and standard deviation have been?
Mean: 15*29.57=443.55
Standard Deviation: 11.46*29.57=338.87
56. Outliers Outliers- unusual values that do not fit with the overall group
Cause the mean to be unrepresentative Recall that outliers are unusual values that do not fit with the overall group. Outliers can be problematic because they make the mean unrepresentative and can pull it away from the median.Recall that outliers are unusual values that do not fit with the overall group. Outliers can be problematic because they make the mean unrepresentative and can pull it away from the median.
57. Causes of Outliers Data entry errors.
Person records values incorrectly
Should be corrected
Value from another population.
Should be teenager but is really a toddler
Should be corrected When thinking about outliers there are many causes of outliers. The most typical cause of outliers is data entry errors. Often when individuals are entering values into a computer they make minor errors such as entering a 0 instead of a decimal point. If we notice an outlier we should try to determine if this is a data entry error. If this is a data entry error we should correct it.
Another common problem is that we have a value from another population. This occurs when we have a value from a different group that is inadvertently included with our group. For instance if a value is with the teenagers but is really a toddler it will give a value that is an outlier. Like the data entry errors we should correct this value.
When thinking about outliers there are many causes of outliers. The most typical cause of outliers is data entry errors. Often when individuals are entering values into a computer they make minor errors such as entering a 0 instead of a decimal point. If we notice an outlier we should try to determine if this is a data entry error. If this is a data entry error we should correct it.
Another common problem is that we have a value from another population. This occurs when we have a value from a different group that is inadvertently included with our group. For instance if a value is with the teenagers but is really a toddler it will give a value that is an outlier. Like the data entry errors we should correct this value.
58. Causes of Outliers Actual unusual values
Sometimes you have student who is 7’ tall
Should be explored and verified.
Ask why it occurred –may give more information about the phenomena The final cause of outliers is an actual unusual value. Sometimes there are actually individuals that are really different from the others. Occasionally there is actually an individual who is 7 feet tall. When there is a actual unusual value then we should use this as an opportunity to explore the data. We should try to understand the reasons behind the unusual value and take the insight it gives about the phenomena we are studying. The final cause of outliers is an actual unusual value. Sometimes there are actually individuals that are really different from the others. Occasionally there is actually an individual who is 7 feet tall. When there is a actual unusual value then we should use this as an opportunity to explore the data. We should try to understand the reasons behind the unusual value and take the insight it gives about the phenomena we are studying.
59. Impact of Outliers Can substantially change the mean and standard deviation
Relative to data without outlier
Does not change median
Resistant to outliers Why are outliers of concern? An outlier can substantially change the mean and standard deviation of a data set. However, outliers generally do not change the median. The median is what we refer to as resistant to outliers. Since it is resistant to outliers we might want to use this value when we might have outliers. We might want to use other values that are resistant to outliers. Why are outliers of concern? An outlier can substantially change the mean and standard deviation of a data set. However, outliers generally do not change the median. The median is what we refer to as resistant to outliers. Since it is resistant to outliers we might want to use this value when we might have outliers. We might want to use other values that are resistant to outliers.
60. Five Number Summary Quartiles
1st Quartile- value that has 25% of the data below. (median of the lower half) Q1
3rd Quartile- value that has 25% of the data above. (median of the upper half) Q3 The median marks off the point with 50% of the data above it and 50% below. Two other values that mark off a percentage of the data are the quartiles. The first quartile is the value that has 25% of the data below it. The third quartile is the value that has 25% of the data above it. We use the notation Q1 and Q3. To find the quartiles we simply look at one half of the data and find the median of that half. That will give us the quartile. The median of the upper half is the third quartile, the median of the lower half is the first quartile.The median marks off the point with 50% of the data above it and 50% below. Two other values that mark off a percentage of the data are the quartiles. The first quartile is the value that has 25% of the data below it. The third quartile is the value that has 25% of the data above it. We use the notation Q1 and Q3. To find the quartiles we simply look at one half of the data and find the median of that half. That will give us the quartile. The median of the upper half is the third quartile, the median of the lower half is the first quartile.
61. Five Number Summary Five numbers that tell us about data set
Includes minimum, maximum, Q1, Q3, median We can use the median along with the quartiles and two other values the minimum and the maximum will give us five values that tell us about the data set. These are referred to as the five number summary. The five number summary gives us a set of numbers that can tell us about the data set. With these five numbers we break the data into the four quarters. Many people use the five number summary as a first look at a data set. It gives us an idea about the location and and spread of the data but three of the values are resistant to outliers. The quartiles are also used in calculating a measure of variability.We can use the median along with the quartiles and two other values the minimum and the maximum will give us five values that tell us about the data set. These are referred to as the five number summary. The five number summary gives us a set of numbers that can tell us about the data set. With these five numbers we break the data into the four quarters. Many people use the five number summary as a first look at a data set. It gives us an idea about the location and and spread of the data but three of the values are resistant to outliers. The quartiles are also used in calculating a measure of variability.
62. Inter-quartile Range Measure of variability
Width of the middle 50%.
IQR = Q3-Q1
Resistant to outliers- not as heavily influenced by outliers
Does not summarize all values.
The interquartile range is a measure of variability based on the quartiles. We use the notation IQR to indicate the interquartile range. It is calculated by taking the difference between the third and first quartiles. The middle 50% of the data would be between Q1 and Q3. The IQR gives us the width of that middle 50%. The main advantage of the IQR is that it is resistant to outliers. The standard deviation is heavily influenced by outliers, but the IQR is not. Often when there are outliers we might use the IQR to measure the variability. The interquartile range is a measure of variability based on the quartiles. We use the notation IQR to indicate the interquartile range. It is calculated by taking the difference between the third and first quartiles. The middle 50% of the data would be between Q1 and Q3. The IQR gives us the width of that middle 50%. The main advantage of the IQR is that it is resistant to outliers. The standard deviation is heavily influenced by outliers, but the IQR is not. Often when there are outliers we might use the IQR to measure the variability.
63. Boxplots also called box and whisker plots
Present the 5 number summary in graphical form
Helps understand data The 5 number summary is often used as the basis for visualizing a data set. The boxplot or what is sometimes called the box and whisker plot is a visualization of the 5 number summary. Many people tend to use boxplots as a first look at a data set. The 5 number summary is often used as the basis for visualizing a data set. The boxplot or what is sometimes called the box and whisker plot is a visualization of the 5 number summary. Many people tend to use boxplots as a first look at a data set.
64. Boxplot To see what is involved in a boxplot lets examine one. This one is for the heights of a group of undergraduate students. You will notice a box in the center of the graphic and two lines that extend horizontally on each side of the box. This box and the lines indicate where the values of the 5 number summary are located.To see what is involved in a boxplot lets examine one. This one is for the heights of a group of undergraduate students. You will notice a box in the center of the graphic and two lines that extend horizontally on each side of the box. This box and the lines indicate where the values of the 5 number summary are located.
65. Boxplot The vertical line in the middle of the box indicates where the median of the data set is located. The vertical line in the middle of the box indicates where the median of the data set is located.
66. Boxplot The vertical lines at the end of the box indicate where the quartiles are located. This box is useful because it indicates where the middle 50% of the data are located. So from this we see that 50% of these students are between about 64 and 70 inches in height. The vertical lines at the end of the box indicate where the quartiles are located. This box is useful because it indicates where the middle 50% of the data are located. So from this we see that 50% of these students are between about 64 and 70 inches in height.
67. Boxplot The maximum and minimums are located at the end of the lines so the maximum is at about 78 and the minimum at about 58 inches. We can use the boxplot to quickly understand the data. The maximum and minimums are located at the end of the lines so the maximum is at about 78 and the minimum at about 58 inches. We can use the boxplot to quickly understand the data.
68. Shapes from boxplots Shapes of distributions from boxplots
Whiskers indicate the long tail of a distribution
Skewed or symmetric. Looking at a boxplot we can also determine the shape of a distribution. The whiskers in a boxplot indicate one quarter of the data. If that whisker is more stretched out than the other we would have a skewed distribution. Lets look at some examples.Looking at a boxplot we can also determine the shape of a distribution. The whiskers in a boxplot indicate one quarter of the data. If that whisker is more stretched out than the other we would have a skewed distribution. Lets look at some examples.
69. Long whisker to the right This boxplot has a long tail to the right. This long tail indicates a distribution that is skewed to the right.This boxplot has a long tail to the right. This long tail indicates a distribution that is skewed to the right.
70. If we look back to the boxplot of heights we examined earlier. In this boxplot both whiskers are about the same. This indicates a distribution that is approximately symmetric.If we look back to the boxplot of heights we examined earlier. In this boxplot both whiskers are about the same. This indicates a distribution that is approximately symmetric.
71. Long whisker to the left This boxplot has a long tail to the left. This long tail indicates a distribution that is skewed to the left.This boxplot has a long tail to the left. This long tail indicates a distribution that is skewed to the left.
72. Shapes from boxplots Note: We can not readily determine if a distribution is multimodal from a boxplot. We can not readily determine if a distribution is multimodal from a boxplot.We can not readily determine if a distribution is multimodal from a boxplot.
73. Outliers computer programs identify automatically
Whiskers extend to largest/smallest non-outliers
Uses asterisk or dots to mark outliers Boxplots are often used as a first look at a data set. Computer programs often take advantage of the fact by automatically marking suspected outliers in boxplots. The outliers are typically indicated with an asterisk, circle or a dot to mark the outlier. Since the outliers could be maximums or minimums the whiskers no longer go all the way to that max or min. Instead in these modified boxplots the whiskers extend to the largest or smallest non-outlier in the data set. Boxplots are often used as a first look at a data set. Computer programs often take advantage of the fact by automatically marking suspected outliers in boxplots. The outliers are typically indicated with an asterisk, circle or a dot to mark the outlier. Since the outliers could be maximums or minimums the whiskers no longer go all the way to that max or min. Instead in these modified boxplots the whiskers extend to the largest or smallest non-outlier in the data set.
74. Comparative Boxplots Here is an example of another way in which boxplots are often used. This data splits the heights we examined earlier into males and females. This allows us to easily compare the two groups. We see that both distributions are about symmetric. We can quickly compare the middle 50% of the groups For the males the middle 50% is in the 70s, while the middle 50% of the females is in the mid 60s. The males are shifted over from the females about five seven inches. We also notice an outlier in the males. Here is an example of another way in which boxplots are often used. This data splits the heights we examined earlier into males and females. This allows us to easily compare the two groups. We see that both distributions are about symmetric. We can quickly compare the middle 50% of the groups For the males the middle 50% is in the 70s, while the middle 50% of the females is in the mid 60s. The males are shifted over from the females about five seven inches. We also notice an outlier in the males.
75. Comparative Boxplots The outlier is indicated by the dot. The whisker extends to what the computer believes is the smallest non-outlier. The outlier is indicated by the dot. The whisker extends to what the computer believes is the smallest non-outlier.
76. Comparative Boxplots Males located about 7 inches higher
Outlier among males
Variability about the same
Both distributions are roughly symmetric
So by comparing the outliers we see that the males are about 7 inches higher than the females and both groups have similar variability and that both groups are roughly symmetric.So by comparing the outliers we see that the males are about 7 inches higher than the females and both groups have similar variability and that both groups are roughly symmetric.
77. Comparative Boxplots Lets consider again the question of the outlier. What might be the cause of this outlier? Thinking back to the possible causes of outliers they were data entry error, wrong population, or actual value.Lets consider again the question of the outlier. What might be the cause of this outlier? Thinking back to the possible causes of outliers they were data entry error, wrong population, or actual value.
78. Outlier in the males What is the cause of this outlier?
Data entry error? Should be 73?
Wrong population? A female incorrectly recorded as a male?
Actual unusual value? A male that is really 63 inches tall?
We should explore each of these three causes. Could it be data entry error? Perhaps the 63 was incorrectly entered as a 73. Maybe a male was recording his height and was 6’3” but incorrectly thought that was 63 inches. Or perhaps it is from the wrong population? Perhaps this is a female that is incorrectly recorded as a male. If so it should be part of the boxplot for females. Or perhaps there is just actually a male who is 63 inches tall. Males who are 5’3” tall are rare but not unheard of and could be an actual value. We should explore each of these possibilities and try to determine which is the real cause. Could it be data entry error? Perhaps the 63 was incorrectly entered as a 73. Maybe a male was recording his height and was 6’3” but incorrectly thought that was 63 inches. Or perhaps it is from the wrong population? Perhaps this is a female that is incorrectly recorded as a male. If so it should be part of the boxplot for females. Or perhaps there is just actually a male who is 63 inches tall. Males who are 5’3” tall are rare but not unheard of and could be an actual value. We should explore each of these possibilities and try to determine which is the real cause.
79. CoalEmissions Uncertainty Project (2009-10), Alissa Anderson, Colin Geisenhoffer, Brody Heffner, Michael Shaw & Emily Wisner After 2% Trim Before 2% Trim
80. Class Problem Here is a summary of strength data
N=153 Mean = 135.39, Median = 135.40, Trimmed Mean = 135.1, Standard Deviation = 4.59, Lower Quartile = 132.95, the upper quartile =138.25, the minimum = 122.2 and the maximum=138,25.
Construct a boxplot.
Comment on any interesting features. Chapter 2.3 page 86 problem 33.
The mean & median are nearly equal and the upper and lower quartiles are nearly equidistant from the median, so the data is approximately symmetrically distributed around the median (and mean).Chapter 2.3 page 86 problem 33.
The mean & median are nearly equal and the upper and lower quartiles are nearly equidistant from the median, so the data is approximately symmetrically distributed around the median (and mean).