1.05k likes | 1.06k Views
This section discusses the difference between analyzing populations and analyzing samples when calculating measures of central tendency and dispersion. It covers how parameters and statistics are used to represent descriptive measures, including the arithmetic mean, median, and mode. Examples are provided to illustrate the calculations for both odd and even numbers of observations.
E N D
Chapter 3 Numerically Summarizing Data
3.1 3.2 3.3 3.4 3.5 Chapter 3 • Chapter 3 – Numerically Summarizing Data • Measures of Central Tendency • Measures of Dispersion • Measures of Central Tendency and Dispersion from Grouped Data • Measures of Position • The Five Number Summary and Boxplots
Chapter 3Section 1 Measures of Central Tendency
Chapter 3 – Section 1 • Analyzing populations versus analyzing samples • Analyzing populations versus analyzing samples • For populations • We know all of the data • Descriptive measures of populations are called parameters • Parameters are often written using Greek letters ( μ ) • Analyzing populations versus analyzing samples • For populations • We know all of the data • Descriptive measures of populations are called parameters • Parameters are often written using Greek letters ( μ ) • For samples • We know only part of the entire data • Descriptive measures of samples are called statistics • Statistics are often written using Roman letters ( )
Chapter 3 – Section 1 • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • Add up the three numbers and divide by 3 (6 + 1 + 5) / 3 = 4.0 • The arithmetic mean is 4.0
Chapter 3 – Section 1 • The arithmetic mean is usually called the mean • The arithmetic mean is usually called the mean • For a population … the population mean • Is computed using all the observations in a population • Is denoted μ • Is a parameter • The arithmetic mean is usually called the mean • For a population … the population mean • Is computed using all the observations in a population • Is denoted μ • Is a parameter • For a sample … the sample mean • Is computed using only the observations in a sample • Is denoted • Is a statistic
Chapter 3 – Section 1 • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The calculation of the median of a variable is slightly different depending on • If there are an odd number of points, or • If there are an even number of points
Chapter 3 – Section 1 • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • If n is even • There are two values on either side of the exact middle • Take their mean to be the median M
Chapter 3 – Section 1 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • The middle number is 6, so the median is 6
Chapter 3 – Section 1 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • Take the mean of the two middle values (2 + 6) / 2 = 4 • The median is 4
M = 79.5 62, 68, 71, 74, 77 5 on the left 82, 84, 88, 90, 94 5 on the right Chapter 3 – Section 1 • One interpretation • The median splits the data into halves 62, 68, 71, 74, 77, 82, 84, 88, 90, 94
Chapter 3 – Section 1 • The mode of a variable is the most frequently occurring value • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The value 6 occurs twice, all the other values occur only once • The mode is 6
Chapter 3 – Section 1 • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Find the mode of blue, blue, blue, red, green • The mode is “blue” because it is the value that occurs the most often
Chapter 3 – Section 1 • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Each value occurs only once • The mode is not a meaningful measurement
Chapter 3 – Section 1 • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • Votes (data values) are • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • Votes (data values) are • The mode is “Kayla” … Kayla is the winner
Chapter 3 – Section 1 • The mean and the median are often different • This difference gives us clues about the shape of the distribution • Is it symmetric? • Is it skewed left? • Is it skewed right? • Are there any extreme values?
Chapter 3 – Section 1 • Symmetric – the mean will usually be close to the median • Skewed left – the mean will usually be smaller than the median • Skewed right – the mean will usually be larger than the median
Chapter 3 – Section 1 • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • Thus the mean will be close to the median, in general, for a distribution that is symmetric
Chapter 3 – Section 1 • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • Thus the mean will be smaller than the median, in general, for a distribution that is skewed left
Chapter 3 – Section 1 • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • Thus the mean will be larger than the median, in general, for a distribution that is skewed right
Chapter 3 – Section 1 • For a mostly symmetric distribution, the mean and the median will be roughly equal • Many variables, such as birth weights below, are approximately symmetric
Chapter 3 – Section 1 • What if one value is extremely different from the others? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • What if one value is extremely different from the others ( this is so called an outlier)? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The mean is now ( 6000 + 1 + 2 ) / 3 = 2001 • The median is still 2 • The median is “resistant to extreme values”
Summary: Chapter 3 – Section 1 • Mean • The center of gravity • Useful for roughly symmetric quantitative data • Median • Splits the data into halves • Useful for highly skewed quantitative data • Mode • The most frequent value • Useful for qualitative data
Chapter 3Section 2 Measures of Dispersion
1 2 3 5 4 Chapter 3 – Section 2 • Learning objectives • The range of a variable • The variance of a variable • The standard deviation of a variable • Use the Empirical Rule • Use Chebyshev’s inequality
Chapter 3 – Section 2 • Comparing two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • The measures of dispersion in this section measure the differences between how far “spread out” the data values are
Chapter 3 – Section 2 • The range of a variable is the largest data value minus the smallest data value • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • Subtracting the two … 11 – 1 = 10 … the range is 10
Chapter 3 – Section 2 • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range is now ( 6000 – 1 ) = 5999
Chapter 3 – Section 2 • The variance is based on the deviation from the mean • ( xi – μ ) for populations • ( xi – ) for samples • The variance is based on the deviation from the mean • ( xi – μ ) for populations • ( xi – ) for samples • To treat positive differences and negative differences, we square the deviations • ( xi – μ )2 for populations • ( xi – )2 for samples
Chapter 3 – Section 2 • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The population variance is represented by σ2 • Note: For accuracy, use as many decimal places as allowed by your calculator
Chapter 3 – Section 2 • Compute the population variance of 6, 1, 2, 11 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Average the squared deviations (16 + 9 + 1 + 36) / 4 = 15.5 • The population variance σ2 is 15.5
Chapter 3 – Section 2 • The samplevariance of a variable is the sum of these squared deviations divided by one less than the number in the sample • The samplevariance of a variable is the sum of these squared deviations divided by one less than the number in the sample • The sample variance is represented by s2 • We say that this statistic has n – 1 degrees of freedom
Chapter 3 – Section 2 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Average the squared deviations (16 + 9 + 1 + 36) / 3 = 20.7 • The sample variance s2 is 20.7
Chapter 3 – Section 2 • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) • These are two different situations
Chapter 3 – Section 2 • Why do we use different formulas? • The reason is that using the sample mean is not quite as accurate as using the population mean • If we used “n” in the denominator for the sample variance calculation, we would get a “biased” result • Bias here means that we would tend to underestimate the true variance
Chapter 3 – Section 2 • The standarddeviation is the square root of the variance • The standarddeviation is the square root of the variance • The populationstandarddeviation • Is the square root of the population variance (σ2) • Is represented by σ • The standarddeviation is the square root of the variance • The populationstandarddeviation • Is the square root of the population variance (σ2) • Is represented by σ • The samplestandarddeviation • Is the square root of the sample variance (s2) • Is represented by s
Chapter 3 – Section 2 • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the sample is { 6, 1, 2, 11 } • The sample variance s2 = 20.7 • The sample standard deviation s = • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the sample is { 6, 1, 2, 11 } • The sample variance s2 = 20.7 • The sample standard deviation s = • The population standard deviation and the sample standard deviation apply in different situations
Chapter 3 – Section 2 • The standard deviation is very useful for estimating probabilities
Chapter 3 – Section 2 • The empirical rule • If the distribution is roughly bell shaped, then • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • Approximately 95% of the data will lie within 2 standard deviations of the mean • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • Approximately 95% of the data will lie within 2 standard deviations of the mean • Approximately 99.7% of the data (i.e. almost all) will lie within 3 standard deviations of the mean
Chapter 3 – Section 2 • For a variable with mean 17 and standard deviation 3.4 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • Approximately 99.7% of the values will lie between(17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • Approximately 99.7% of the values will lie between(17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 • A value of 2.1 and a value of 33.2 would both be very unusual
Chapter 3 – Section 2 • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • This lower bound is • An estimated percentage • The actual percentage for any variable cannot be lower than this number • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • This lower bound is • An estimated percentage • The actual percentage for any variable cannot be lower than this number • Therefore the actual percentage must be this value or higher
Chapter 3 – Section 2 • Chebyshev’s inequality • For any data set, at least of the observations will lie within k standard deviations of the mean, where k is any number greater than 1
Chapter 3 – Section 2 • How much of the data lies within 1.5 standard deviations of the mean? • From Chebyshev’s inequality so that at least 55.6% of the data will lie within 1.5 standard deviations of the mean
Chapter 3 – Section 2 • If the mean is equal to 20 and the standard deviation is equal to 4, how much of the data lies between 14 and 26? • 14 to 26 are 1.5 standard deviations from 20 so that at least 55.6% of the data will lie between 14 and 26
Summary: Chapter 3 – Section 2 • Range • The maximum minus the minimum • Not a resistant measurement • Variance and standard deviation • Measures deviations from the mean • Not a resistant measurement • Empirical rule • About 68% of the data is within 1 standard deviation • About 95% of the data is within 2 standard deviations
Chapter 3Section 3 Measures of Central Tendency and Dispersion from Grouped Data
1 2 3 Chapter 3 – Section 3 • Learning objectives • The mean from grouped data • The weighted mean • The variance and standard deviation for grouped data
Chapter 3 – Section 3 • Data may come in groups rather than individually • The values may have been summarized in frequency distributions • Ranges of ages (20 – 29, 30 – 39, ...) • Ranges of incomes ($10,000 – $19,999, $20,000 – $39,999, $40,000 – $79,999, ...) • The exact values for the mean, variance, and standard deviation cannot be calculated
1 2 3 Chapter 3 – Section 3 • Learning objectives • The mean from grouped data • The weighted mean • The variance and standard deviation for grouped data
Chapter 3 – Section 3 • To compute the mean for grouped data • Assume that, within each class, the mean of the data is equal to the class midpoint • Use the class midpoint in the formula for the mean • The number of times the class midpoint value is used is equal to the frequency of the class • To compute the mean for grouped data • Assume that, within each class, the mean of the data is equal to the class midpoint • Use the class midpoint in the formula for the mean • The number of times the class midpoint value is used is equal to the frequency of the class • If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ]