1 / 81

Numerical Descriptive Techniques

Learn about central tendency, variation, and shape measures that are used to describe data numerically. Explore the arithmetic mean, median, mode, variance, and standard deviation in this comprehensive guide.

watsonm
Download Presentation

Numerical Descriptive Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Numerical Descriptive Techniques

  2. Summary Measures Describing Data Numerically Central Tendency Variation Shape Arithmetic Mean Range Skewness Median Interquartile Range Mode Variance Geometric Mean Standard Deviation Coefficient of Variation Quartiles

  3. Measures of Central Location • Usually, we focus our attention on two types of measures when describing population characteristics: • Central location • Variability or spread The measure of central location reflects the locations of all the actual data points.

  4. With one data point clearly the central location is at the point itself. Measures of Central Location • The measure of central location reflects the locations of all the actual data points. • How? With two data points, the central location should fall in the middle between them (in order to reflect the location of both of them). But if the third data point appears on the left hand-side of the midrange, it should “pull” the central location to the left.

  5. Sum of the observations Number of observations Mean = The Arithmetic Mean • This is the most popular and useful measure of central location

  6. The Arithmetic Mean Sample mean Population mean Sample size Population size

  7. Example 2 Suppose the telephone bills represent the populationof measurements. The population mean is The Arithmetic Mean • Example 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11.0 42.19 38.45 45.77 43.59

  8. The Arithmetic Mean • Drawback of the mean: It can be influenced by unusual observations, because it uses all the information in the data set.

  9. Example 3 Find the median of the time on the internetfor the 10 adults of example 1 Suppose only 9 adults were sampled (exclude, say, the longest time (33)) Comment Even number of observations 0, 0, 5, 7, 8,9, 12, 14, 22, 33 The Median • The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. It divides the data in half. Odd number of observations 8 8.5, 0, 0, 5, 7, 89, 12, 14, 22 0, 0, 5, 7, 8,9, 12, 14, 22, 33

  10. The Median • Median of 8 2 9 11 1 6 3 n = 7 (odd sample size). First order the data. 1 2 3 6 8 9 11 Median • For odd sample size, median is the {(n+1)/2}th ordered observation.

  11. The Median • The engineering group receives e-mail requests for technical information from sales and services person. The daily numbers for 6 days were 11, 9, 17, 19, 4, and 15. What is the central location of the data? • For even sample sizes, the median is the average of {n/2}th and {n/2+1}th ordered observations.

  12. The Mode • The Mode of a set of observations is the value that occurs most frequently. • Set of data may have one mode (or modal class), or two or more modes. For large data sets the modal class is much more relevant than a single-value mode. The modal class

  13. The Mode • Find the mode for the data in Example 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution • All observation except “0” occur once. There are two “0”. Thus, the mode is zero. • Is this a good measure of central location? • The value “0” does not reside at the center of this set(compare with the mean = 11.0 and the median = 8.5).

  14. Relationship among Mean, Median, and Mode • If a distribution is symmetrical, the mean, median and mode coincide Mean = Median = Mode • If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode < Median < Mean Mode Mean Median

  15. Relationship among Mean, Median, and Mode • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A negatively skewed distribution (“skewed to the left”) A positively skewed distribution (“skewed to the right”) Mode Mean Mean Mode Median Mean < Median < Mode Median

  16. Geometric Mean • The arithmetic mean is the most popular measure of the central location of the distribution of a set of observations. • But the arithmetic mean is not a good measure of the average rate at which a quantity grows over time. That quantity, whose growth rate (or rate of change) we wish to measure, might be the total annual sales of a firm or the market value of an investment. • The geometric mean should be used to measure the average growth rate of the values of a variable over time.

  17. Example

  18. Measures of variability • Measures of central location fail to tell the whole story about the distribution. • A question of interest still remains unanswered: How much are the observations spread out around the mean value?

  19. Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to...

  20. Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before.

  21. ? ? ? The range • The range of a set of observations is the difference between the largest and smallest observations. • Its major advantage is the ease with which it can be computed. • Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. But, how do all the observations spread out? The range cannot assist in answering this question Range Smallest observation Largest observation

  22. This measure reflects the dispersion of all the observations • The variance of a population of size N, x1, x2,…,xN whose mean is m is defined as • The variance of a sample of n observationsx1, x2, …,xnwhose mean is is defined as The Variance

  23. Sum = 0 Sum = 0 Why not use the sum of deviations? Consider two small populations: 9-10= -1 A measure of dispersion Should agrees with this observation. 11-10= +1 Can the sum of deviations Be a good measure of dispersion? The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion. 8-10= -2 A 12-10= +2 8 9 10 11 12 …but measurements in B are more dispersed than those in A. The mean of both populations is 10... 4-10 = - 6 16-10 = +6 B 7-10 = -3 13-10 = +3 4 7 10 13 16

  24. The Variance Let us calculate the variance of the two populations Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of variation instead? After all, the sum of squared deviations increases in magnitude when the variation of a data set increases!!

  25. The Variance Let us calculate the sum of squared deviations for both data sets Which data set has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5

  26. SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 The Variance SumA > SumB. This is inconsistent with the observation that set B is more dispersed. A B 1 3 1 2 3 5

  27. The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. sA2 = SumA/N = 10/10 = 1 sB2 = SumB/N = 8/2 = 4 A B 1 3 1 2 3 5

  28. The Variance • Example 4 • The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Find its mean and variance • Solution

  29. The Variance – Shortcut method

  30. Standard Deviation • The standard deviation of a set of observations is the square root of the variance .

  31. Standard Deviation • Example 5 • To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club. • The distances were recorded. • Which 7-iron is more consistent?

  32. Standard Deviation • Example 5 – solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club

  33. Interpreting Standard Deviation • The standard deviation can be used to • compare the variability of several distributions • make a statement about the general shape of a distribution. • The empirical rule: If a sample of observations has a mound-shaped distribution, the interval

  34. Interpreting Standard Deviation • Example 6A statistics practitioner wants to describe the way returns on investment are distributed. • The mean return = 10% • The standard deviation of the return = 8% • The histogram is bell shaped.

  35. Interpreting Standard Deviation Example 6 – solution • The empirical rule can be applied (bell shaped histogram) • Describing the return distribution • Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] • Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] • Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]

  36. The Chebyshev’s Theorem • For any value of k  1, greater than 100(1-1/k2)% of the data lie within the interval from to . • This theorem is valid for any set of measurements (sample, population) of any shape!! k Interval Chebyshev Empirical Rule 1 at least 0% approximately 68% 2 at least 75% approximately 95% 3 at least 89% approximately 99.7% (1-1/12) (1-1/22) (1-1/32)

  37. The Chebyshev’s Theorem • Example 7 • The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain? SolutionAt least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000) At least 88.9% of the salaries lie between $$19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)

  38. The Coefficient of Variation • The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. • This coefficient provides a proportionate measure of variation. A standard deviation of 10 may be perceived large when the mean value is 100, but only moderately large when the mean value is 500

  39. Your score Sample Percentiles and Box Plots • Percentile • The pth percentile of a set of measurements is the value for which • p percent of the observations are less than that value • 100(1-p) percent of all the observations are greater than that value. • Example • Suppose your score is the 60% percentile of a SAT test. Then 40% 60% of all the scores lie here

  40. Sample Percentiles • To determine the sample 100p percentile of a data set of size n, determine a) At least np of the values are less than or equal to it. b) At least n(1-p) of the values are greater than or equal to it. • Find the 10 percentile of 6 8 3 6 2 8 1 • Order the data: 1 2 3 6 6 8 • Find np and n(1-p): 7(0.10) = 0.70 and 7(1-0.10) = 6.3 A data value such that at least 0.7 of the values are less than or equal to it and at least 6.3 of the values greater than or equal to it. So, the first observation is the 10 percentile.

  41. Quartiles • Commonly used percentiles • First (lower)decile = 10th percentile • First (lower) quartile, Q1= 25th percentile • Second (middle)quartile,Q2 = 50th percentile • Third quartile, Q3 = 75th percentile • Ninth (upper)decile = 90th percentile

  42. Quartiles • Example 8 Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8

  43. 15 observations Quartiles • Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30 The first quartile At most (.25)(15) = 3.75 observations should appear below the first quartile. Check the first 3 observations on the left hand side. At most (.75)(15)=11.25 observations should appear above the first quartile. Check 11 observations on the right hand side. Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

  44. Location of Percentiles • Find the location of any percentile using the formula • Example 9 Calculate the 25th, 50th, and 75th percentile of the data in Example 1

  45. 0 5 Values 3.75 0 2.75 2 3 Location 1 Location Location 3 The 2.75th location Translates to the value (.75)(5 – 0) = 3.75 Location of Percentiles • Example 9 – solution • After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.

  46. Location of Percentiles • Example 9 – solution continued The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5.

More Related