The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE

Chapter 1: Exploring Data Section 1.3 Describing Quantitative Data with Numbers The Practice of Statistics, 4th edition - For AP* STARNES, YATES, MOORE

Measuring Center: The Mean The most common measure of center is the ordinary arithmetic average, or mean. Definition: To find the mean(pronounced “x-bar”) of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, …, xn, their mean is: In mathematics, the capital Greek letter Σ is short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation:

Measuring Center: The Median In Section 1.2, we introduced the median as an informal measure of center that describes the “midpoint” of a distribution. Now it’s time to offer an official “rule” for calculating the median. The median,M,is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. . Definition: The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. • To find the median of a distribution: • *Arrange all observations from smallest to largest. • *If the number of observations n is odd, the median M is the center observation in the ordered list. • *If the number of observations n is even, the median M is the average of the two center observations in the ordered list. • *Medians require little arithmetic, so they are easy to find by hand for small sets of data. Arranging even a moderate number of values in order is tedious, however, so finding the median by hand for larger sets of data is unpleasant.

Example 1: Use the data below to calculate the mean and median of the commuting times (in minutes) of 20 randomly selected New York workers. 0 5 1 005555 2 0005 3 00 4 005 5 6 005 7 8 5 Key: 4|5 represents a New York worker who reported a 45-minute travel time to work.

The previous example illustrates an important weakness of the mean as a measure of center: the mean is sensitive to the influence of extreme observations. These may be outliers, but a skewed distribution that has no outliers will also pull the mean toward its long tail. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center.

Comparing the Mean and the Median The mean and median measure center in different ways, and both are useful. Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median. Comparing the Mean and the Median The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Measuring Spread: The Interquartile Range (IQR) *A measure of center alone can be misleading. *A useful numerical description of a distribution requires both a measure of center and a measure of spread. How to Calculate the Quartiles and the Interquartile Range To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M. 2. The first quartile Q1is the median of the observations located to the left of the median in the ordered list. 3. The third quartile Q3is the median of the observations located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 – Q1 Be careful in locating the quartiles when several observations take the same numerical value. Write down all the observations, arrange them in order, and apply the rules just as if they all had distinct values.

Example 2: Below are the travel times to work for 20 randomly selected New Yorkers. Find and interpret the interquartile range (IQR). M = 22.5 Q3= 42.5 Q1= 15 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.

Identifying Outliers In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. Definition: The 1.5 x IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.

Example 3: Does the 1.5 × IQR rule identify any outliers for the New York travel time data? In the previous example, we found that Q1 = 15 minutes, Q3 = 42.5 minutes, and IQR = 27.5 minutes. For these data, 1.5 x IQR = 1.5(27.5) = 41.25 Q1–1.5 x IQR = 15 – 41.25 = –26.25 Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75 Any travel time shorter than –26.25 minutes or longer than 83.75 minutes is considered an outlier. 0 5 1 005555 2 0005 3 00 4 005 5 6 005 7 8 5 AP EXAM TIP: You may be asked to determine whether a quantitative data set has any outliers. Be prepared to state and use the rule for identifying outliers.

The Five-Number Summary *The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution. *To get a quick summary of both center and spread, combine all five numbers. Definition: The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q1 M Q3 Maximum These five numbers divide each distribution roughly into quarters. About 25% of the data values fall between the minimum and Q1, about 25% are between Q1 and the median, about 25% are between the median and Q3, and about 25% are between Q3 and the maximum.

Boxplots (Box-and-Whisker Plots) The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot. How to Make a Boxplot *Draw and label a number line that includes the range of the distribution. *Draw a central box from Q1to Q3. *Note the median M inside the box. *Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers.

Example 4: Use the data from Example 2 to construct a boxplot. Max=85 Recall, this is an outlier by the 1.5 x IQR rule M = 22.5 Q1= 15 Min=5 Q3= 42.5

Measuring Spread: The Standard Deviation The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation. Let’s explore it! Consider the following data on the number of pets owned by a group of 9 children. 1 3 4 4 4 5 7 8 9 deviation: 1 – 5 = –4 deviation: 8 –5 = 3 • 1)Calculate the mean. • 2) Calculate each deviation. • deviation = observation – mean = 5

3) Square each deviation. 4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance. 5) Calculate the square root of the variance…this is the standard deviation. This value, 6.5, is called the variance. Standard deviation = square root of variance.

Definition: The standard deviationsxmeasures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.

Here’s a brief summary of the process for calculating the standard deviation. Many calculators report two standard deviations, giving you a choice of dividing by n or by n − 1. The former is usually labeled σx, the symbol for the standard deviation of a population. If your data set consists of the entire population, then it’s appropriate to use σx. More often, the data we’re examining come from a sample. In that case, we should use sx. More important than the details of calculating sx are the properties that determine the usefulness of the standard deviation. sx measures spread about the mean and should be used only when the mean is chosen as the measure of center. sx is always greater than or equal to 0. sx = 0 only when there is no variability. This happens only when all observations have the same value. Otherwise, sx > 0. As the observations become more spread out about their mean, sx gets larger. sx has the same units of measurement as the original observations. For example, if you measure metabolic rates in calories, both the mean and the standard deviation sx are also in calories. This is one reason to prefer sx to the variance , which is in squared calories.

Like the mean, , sx is not resistant. A few outliers can make sx very large. The use of squared deviations makes sx even more sensitive thanto a few extreme observations. For example, the standard deviation of the travel times for the 15 North Carolina workers is 15.23 minutes. If we omit the maximum value of 60 minutes, the standard deviation drops to 11.56 minutes.

Choosing Measures of Center and Spread We now have a choice between two descriptions for center and spread Mean and Standard Deviation Median and Interquartile Range Choosing Measures of Center and Spread • The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. • Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. • NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! AP EXAM TIP: Use statistical terms carefully and correctly on the AP exam. Don’t say “mean” if you really mean “median.” Range is a single number; so are Q1, Q3, and IQR. Avoid colloquial use of language, like “the outlier skews the mean.” Skewed is a shape. If you misuse a term, expect to lose some credit.

The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE