1 / 69

Chapter 3: Numerically Summarizing Data

Chapter 3: Numerically Summarizing Data. 3.1 Measures of Central Tendency 3.2 Measures of Dispersion 3.3 Measures of Central Tendency and Dispersion from Grouped Data 3.4 Measures of Position 3.5 The Five-Number Summary and Boxplots. September 25, 2008. The Mean of a Set. Section 3.1.

honora
Download Presentation

Chapter 3: Numerically Summarizing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3: Numerically Summarizing Data 3.1 Measures of Central Tendency 3.2 Measures of Dispersion 3.3 Measures of Central Tendency and Dispersion from Grouped Data 3.4 Measures of Position 3.5 The Five-Number Summary and Boxplots September 25, 2008

  2. The Mean of a Set Section 3.1

  3. Remark

  4. The Median of a Set In other words, the median is the midpoint of the observations when they are ordered from smallest to largest or vice-versa.

  5. Example 1 Find the mean and median of the set of observations: {20, -3, 4, 10, 6, -1}.

  6. Example 2 Find the mean and median of the set of observations: {-10, -6 ,0, 4, 9}.

  7. Mean and Dot Plot Notice that the mean is a fulcrum for the distribution of point masses on the lever (x-axis).

  8. Add Points (“Weights”) The fulcrum has moved 1.2 units to the left.

  9. Shape, Mean and Median

  10. Outlier • An outlier is an observation (data point) that falls well above or below the overall set of data. • The mean can be highly influence by an outlier. • The median is said to be resistant to outliers i.e., it value is not changed significantly by the addition or removal of an outlier.

  11. Example

  12. Mode • The mode is the most frequent observation of the variable. • It is most often used with categorical data. • For numerical data, it can be used when the data is discrete. The mode of the categorical variable color is 35 (red).

  13. Example Mia Hamm, who retired at the 2004 Olympics, is considered to be the most prolific player in international soccer. He is a list of the number of goals scored over her 18-year career. MHG = {0,0,0,4,10,1,10,10,19,9,18,20,13,13,2,7,8,13}. Considering the population as the number of goals scored by Mia Hamm, find the mean and median and mode of this set.

  14. Mean, Median and Mode and Distribution Shape

  15. Measures of Dispersion Consider the following sets of observations: S1 = {0,0,0,0,0,0,0,0,0,0} S2 = {-5,-4,-3,-2,-1,1,2,3,4,5}. Both sets have the same mean and median (namely, 0). However, the histograms or dot plots are quite different. Yet, their dot plot is very different. Notice that the difference between the smallest and largest number in each set is quite different. Section 3.2

  16. Range of a Set of Observations Remark: The range is completely determined by only two points of the set of observations.

  17. Example Lance Armstrong won the Tour de France seven consecutive times (1999-2005). Here is data about his victories. The ranges for each category of winning are: Winning Time: range = 92.552 - 82.087 = 10.465 Distance: range = 3687 - 3278 = 409 Winning Speed: range = 41.65 - 39.46 = 2.19 Winning Margin: range = 7.283 - 1.017 = 6.266

  18. The Spread of Quantitative Data Consider the frequency distributions of two different data sets. Notice how the tails of each distribution change from being close together to being far apart. Section 2.4

  19. The Deviation from the Mean

  20. Variance and Standard Deviation Definition: The “average” of the square of all deviations in a sample is called the variance of the sample. The standarddeviation of a sample is defined as the square root of the variance. Question: Why n -1 instead of n in these formulas?

  21. Remark There is an unfortunate duplicity on how the words, variance and standard deviation, are used. These quantities are computed different ways, depending on whether the set under consideration is a population or a sample of a population. It turns out that if we use the formulas for variance and standard deviation where we divide by n instead of n-1, then the standard deviation of the sample will consistently underestimate the standard deviation of the population. This is called bias. Hence, we will sometimes use the following definitions and will distinguish between sample standard deviation and population standard deviation.

  22. Example • For the set of observations (sample), {0,-3,10,7,5,-3,0}, • Find the range of the sample. • Find the mean and median of the sample. • Find the variance of the sample. • Find the standard deviation of the sample.

  23. Example • For the two set of observations, S = {-1,0,0,0,1} and T = {-1,-1,-1,-1,0,1,1,1,1}, • Find the mean and median for each set. • Find the standard deviation for each set. We see from the dot plot that the set T has more points that vary from the mean and hence, has a larger standard deviation.

  24. Properties of the Standard Deviation • The larger the spread (variation) in the data, the larger the standard deviation. • The standard deviation is zero only if and only if the set from which it is computed has all of its elements the same in which case the mean of the set is this number. • The standard deviation is influenced by outliers. This is true because the deviation from the mean of the set to the outlier is a large number in absolute value. • The standard deviation yields more information than the range of the set. (Why?)

  25. Example The following data represents the walking time (in minutes) from the dorm or apartment to Professor Bisch’s course on operator algebras. We treat the nine students as the population of Prof. Bisch’s class. Find the population mean and standard deviation. Choose a sample of 4 and compute the mean and standard deviation of the sample.

  26. Bell-shaped (symmetric) Distributions Consider a set of observations that is bell-shaped. All three distributions have different standard deviations.

  27. Empirical Rule for almost Bell-shaped Distributions

  28. Caution The Empirical Rule for bell-shapeddistributions is an empirical law, not a fact. The better the distribution is being perfectly bell-shaped, then better the accuracy of the law. It is useful in telling us how the data is concentrated about the mean of the distribution.

  29. Example

  30. Detailed Empirical Rule

  31. Example The distribution of the length of bolts produced by the Acme Bolt Company is approximately bell-shaped with a mean of 4 inches and a standard deviation of 0.007 inches. What is the range of length for 68% of the bolts produced by this company? What percentage of bolts will be between 3.986 inches and 4.014 inches? If the company discards any bolts that are less than 3.986 inches or greater than 4.014 inches, what percentage of bolts will be discarded? What percentage of the bolts will be between 4.007 inches and 4.021 inches?

  32. Chebyshev Inequality Example: Suppose that a population has a mean of 73.5 and a standard deviation of 5.5. Find an interval that contains atleast 75% of the data points in the population.

  33. Example In December 2004, the average price of regular unleaded gasoline excluding taxes in the United States as $1.37 per gallon. Researchers in the Department of Energy estimated that the standard deviation for this mean price was $0.05. Using Chebyshev’s Inequality,estimate the percentage of gasoline stations that had prices within 3 standard deviations of the mean? What percentage had prices within 2.5 standard deviations?

  34. Remark • Chebyshev’s Inequality does not place any preconditions on the shape of the data set. • It is true for populations and samples. • The theorem does not say that there are exactly 100(1-1/k2)% points in an interval that is one standard deviation from the mean, but rather there are at least this number.

  35. Mean and Standard Deviation for Grouped Data Section 3.3

  36. Example

  37. Example

  38. Weighted Mean of a Set Given a set of numbers, suppose that we believe that some of the numbers are more important than other numbers in the set. To reflect this notation, we defined the weighted mean of a set of numbers.

  39. Example Consider the set S = {-3, 1, 0, 3, -1, 1, 0} and the weights {1.5, 0, 1, -1, 1, 2, 1}. Find the weighted mean of this set with respect to the given weights.

  40. Approximation for Standard Deviation and Variance for Grouped Data

  41. Example

  42. Approximating the Median of grouped Data

  43. Example

  44. Measures of Position in a Distribution • The mean and median give us information about the “center” of a set of observations (the distribution). • The range and standard deviation give us information about the “spread” of the distribution. • We now introduce a concept that is equivalent to the “position” in a distribution. It will use the concept of percentiles. The percentile will how the distribution can be divided into parts (sometimes equal) which in turn will give us the notion of position within the distribution. Section 3.4

  45. z-score

  46. Example Example: Consider the sample: {-1,0,1,5,19}. Compute the z-score for each data point.

  47. Application of z-score The average 20- to 29- year old man is 69.6 inches tall with a standard deviation of 2.7 inches. The average 20- to 29- year old woman is 64.1 inches with a standard deviation of 2.6 inches. With respect to their population, who is relatively taller: a 75-inch man or a 70-inch woman?

  48. Percentile Definition: The kthpercentile in a distribution, Pk, is a number that is the percentage of the observations that fall below or at this value. In other words, it subdivides the total area enclosed by the distribution into two sub-areas, A1 and A2, so that total area is divided into two parts: k and 100-k.

  49. Algorithm for Percentiles

More Related