350 likes | 475 Views
Topic 16. Numerically Summarizing Data- Averages and Five-Number Summary. Topic 16 - Averages. The arithmetic mean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are.
E N D
Topic 16 Numerically Summarizing Data- Averages and Five-Number Summary
Topic 16 - Averages • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • Add up the three numbers and divide by 3 (6 + 1 + 5) / 3 = 4.0 • The arithmetic mean is 4.0
Topic 16 – Median • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The calculation of the median of a variable is slightly different depending on • If there are an odd number of points, or • If there are an even number of points
Topic 16 – Median • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • If n is even • There are two values on either side of the exact middle • Take their mean to be the median M
Topic 16 – Median • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • The middle number is 6, so the median is 6
Topic 16 – Median • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • Take the mean of the two middle values (2 + 6) / 2 = 4 • The median is 4
M = 79.5 62, 68, 71, 74, 77 5 on the left 82, 84, 88, 90, 94 5 on the right Topic 16 – Median • One interpretation • The median splits the data into halves 62, 68, 71, 74, 77, 82, 84, 88, 90, 94
Topic 16 – Mode • The mode of a variable is the most frequently occurring value • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The value 6 occurs twice, all the other values occur only once • The mode is 6
Topic 16 – Mode • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Find the mode of blue, blue, blue, red, green • The mode is “blue” because it is the value that occurs the most often
Topic 16 – Mode • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Each value occurs only once • The mode is not a meaningful measurement
Topic 16 – mean, median & mode • The mean and the median are often different • This difference gives us clues about the shape of the distribution • Is it symmetric? • Is it skewed left? • Is it skewed right? • Are there any extreme values?
Topic 16 – mean, median & mode • Symmetric – the mean will usually be close to the median • Skewed left – the mean will usually be smaller than the median • Skewed right – the mean will usually be larger than the median
Topic 16 – mean, median & mode • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • Thus the mean will be close to the median, in general, for a distribution that is symmetric
Topic 16 – mean, median & mode • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • Thus the mean will be smaller than the median, in general, for a distribution that is skewed left
Topic 16 – mean, median & mode • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • Thus the mean will be larger than the median, in general, for a distribution that is skewed right
Topic 16 – mean, median & mode • For a mostly symmetric distribution, the mean and the median will be roughly equal • Many variables, such as birth weights below, are approximately symmetric
Topic 16 – mean, median & mode • What if one value is extremely different from the others? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • What if one value is extremely different from the others ( this is so called an outlier)? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The mean is now ( 6000 + 1 + 2 ) / 3 = 2001 • The median is still 2 • The median is “resistant to extreme values”
Topic 16 – Summary for the Measure of Center • Mean • The center of gravity • Useful for roughly symmetric quantitative data • Median • Splits the data into halves • Useful for highly skewed quantitative data • Mode • The most frequent value • Useful for qualitative data
Topic 16 – Measure of Spread/Dispersion • Comparing two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • The measures of dispersion in this section measure the differences between how far “spread out” the data values are
Topic 16 – Range • The range of a variable is the largest data value minus the smallest data value • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • Subtracting the two … 11 – 1 = 10 … the range is 10
Topic 16 – Range • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range is now ( 6000 – 1 ) = 5999
Topic 16 -Percentile • The median divides the lower 50% of the data from the upper 50% • The median is the 50th percentile • If a number divides the lower 34% of the data from the upper 66%, that number is the 34th percentile
Topic 16 - Quartiles • The quartiles are the 25th, 50th, and 75th percentiles • Q1 = 25th percentile • Q2 = 50th percentile = median • Q3 = 75th percentile • Quartiles are the most commonly used percentiles • The 50th percentile and the second quartile Q2 are both other ways of defining the median
Topic 16 - Quartiles • Quartiles divide the data set into four equal parts • Quartiles divide the data set into four equal parts • Quartiles divide the data set into four equal parts • Quartiles divide the data set into four equal parts • The topquarter are the values between Q3 and the maximum • Quartiles divide the data set into four equal parts • The topquarter are the values between Q3 and the maximum • The bottomquarter are the values between the minimum and Q1
Topic 16 - Quartiles • Quartiles divide the data set into four equal parts • The interquartilerange (IQR) is the difference between the third and first quartiles IQR = Q3 – Q1 • The IQR is a resistant measurement of dispersion
Topic 16 – Five Number Summary • The five-numbersummary is the collection of • The smallest value • The first quartile (Q1 or P25) • The median (M or Q2 or P50) • The third quartile (Q3 or P75) • The largest value • These five numbers give a concise description of the distribution of a variable
Topic 16 – Five Number Summary • The median • Information about the center of the data • Resistant • The median • Information about the center of the data • Resistant • The first quartile and the third quartile • Information about the spread of the data • Resistant • The median • Information about the center of the data • Resistant • The first quartile and the third quartile • Information about the spread of the data • Resistant • The smallest value and the largest value • Information about the tails of the data • Not resistant
Topic 16 – Five Number Summary • Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 • Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 • Calculations • The minimum = 1 • Q1 = P25 = 7 • M = Q2 = P50 = (16 + 19) / 2 = 17.5 • Q3 = P75 = 27 • The maximum = 54 • Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 • Calculations • The minimum = 1 • Q1 = P25 = 7 • M = Q2 = P50 = (16 + 19) / 2 = 17.5 • Q3 = P75 = 27 • The maximum = 54 • The five-number summary is 1, 7, 17.5, 27, 54
Topic 16 – Boxplot • The five-number summary can be illustrated using a graph called the boxplot • An example of a (basic) boxplot is • The middle box shows Q1, Q2, and Q3 • The horizontal lines (sometimes called “whiskers”) show the minimum and maximum
Topic 16 – Boxplot • To draw a (basic) boxplot: • To draw a (basic) boxplot: • Calculate the five-number summary • To draw a (basic) boxplot: • Calculate the five-number summary • Draw a horizontal line that will cover all the data from the minimum to the maximum • To draw a (basic) boxplot: • Calculate the five-number summary • Draw a horizontal line that will cover all the data from the minimum to the maximum • Draw a box with the left edge at Q1 and the right edge at Q3 • To draw a (basic) boxplot: • Calculate the five-number summary • Draw a horizontal line that will cover all the data from the minimum to the maximum • Draw a box with the left edge at Q1 and the right edge at Q3 • Draw a line inside the box at M = Q2 • To draw a (basic) boxplot: • Calculate the five-number summary • Draw a horizontal line that will cover all the data from the minimum to the maximum • Draw a box with the left edge at Q1 and the right edge at Q3 • Draw a line inside the box at M = Q2 • Draw a horizontal line from the Q1 edge of the box to the minimum and one from the Q3 edge of the box to the maximum
Draw the middle box Draw in the median Draw the minimum and maximum Topic 16 – Boxplot • To draw a (basic) boxplot Voila!
Q1 Q3 Q1 Q3 Max Max Min Min Q1 Q1 Q3 Q3 M M M M Topic 16 • Symmetric distributions
Q1 Q1 Q3 Q3 Q1 Q1 Q3 Q3 Max Max Min Min M M M M Topic 16 – Boxplot • Skewed left distributions
Q1 Q1 Q3 Q3 Q3 Q3 Q1 Q1 Max Max Min Min M M M M Topic 16 – Boxplot • Skewed right distributions
Center Spread Topic 16 – Boxplot • Comparing the “flight” with the “control” samples