700 likes | 894 Views
Overview. 3.1 Measures of Center 3.2 Measures of Variability 3.4 Measures of Position 3.6 Robust Measures. 3.1 Measures of Center. Objectives: By the end of this section, I will be able to… Calculate the mean for a given data set.
E N D
Overview • 3.1 Measures of Center • 3.2 Measures of Variability • 3.4 Measures of Position • 3.6 Robust Measures
3.1 Measures of Center Objectives: By the end of this section, I will be able to… • Calculate the mean for a given data set. • Find the median, and describe why the median is sometimes preferable to the mean. • Find the mode of a data set. • Describe how skewness affects these measures of center.
The Mean • Most well known and widely used measure of center • Simply add up all the numbers and divide by how many numbers you have.
Notation Statisticians like to use specialized notation. • Sample size - how many observations you have in your sample data set, is always denoted by n • ith data value by xi, where i is simply an index or counter indicating a data point • “add them together” is Σ (capital sigma) • The sample mean is called (pronounced “x-bar”)
The sample mean Written as In plain English, this just means that, in order to find the mean x, we 1. Add up all the data values, giving us Σx. 2. And divide by how many observations are in the data set, giving us Σx /n.
The Population Mean m • Mean value of a population is usually unknown • Use x to estimate m • Denote the population mean with m (mu) • Population size is denoted by N. • The mean is sensitive to the presence of extreme values
The Median • The middle data value when the data are put into ascending order • Half of the data values lie below the median and half lie above • If the sample size n is odd, then the median is a unique middle value. • That is, observation when the data are put in ascending order. • If the sample size n is even, then the median is the mean of the two data values in the middle. • That is, the median is the mean of the two data values that lie on either side of the position.
The Mode • French speakers will recognize that the term mode in French refers to fashion • The popularity of clothing often depends on just which style is in fashion • In a data set, the value that is most “in fashion” is the value that occurs the most • The mode of a data set is the data value that occurs with the greatest frequency
Example 3.5 - Cost of mathematical journals The rising cost of research journals has been taking an increasing bite out of library and research budgets. Table 3.3 contains the annual subscription cost of ten research journals in mathematics and statistics for 2006. Find the following. a. The mean journal subscription cost b. The median journal subscription cost c. The mode journal subscription cost
Example 3.5 continued Table 3.3 Annual subscription cost for ten research journals
Example 3.5 continued Solution a. The sample mean journal cost is
Example 3.5 continued b. Since we have n 10 journals, the median is the mean of the two data values that lie on either side of the The median is the mean of the 5th and 6th data values, $850 and $1022 median journal cost =
Example 3.5 continued c. • The mode is the data value that occurs with the greatest frequency. • Only two journals that cost $250 each. • No other cost occurs more than once. • Mode = $250. • Mode is not a very good measure of center for this data set because it is the minimum value. • Illustrates a weakness of using the mode as a measure of center.
How Skewness Affects the Mean and Median • For a right-skewed distribution, the mean is larger than the median. • For a left-skewed distribution, the median is larger than the mean. • For a symmetric unimodal distribution, the mean, median, and mode are fairly close to one another. FIGURE 3.5 How skewness affects the mean and median.
Exploratory Data Analysis • Using graphical methods to compare numerical statistics FIGURE 3.6 Dotplots of the percentage net price change for the Dow Jones Industrial Average, the randomly selected darts portfolio, and the professionally selected portfolio.
Summary • The sample mean represents the sum of the data values in the sample divided by the sample size (n). • The population mean (m) represents the sum of the data values in the population divided by the population size (N). • The mean is sensitive to the presence of extreme values.
Summary • The median occupies the middle position when the data are put in ascending order and is not sensitive to extreme values. • The mode is the data value that occurs with the greatest frequency. • Modes can be applied to categorical data as well as numerical data but are not always reliable as measures of center.
Summary • The skewness of a distribution can often tell us something about the relative values of the mean and the median.
3.2 Measures of Variability Objectives: By the end of this section, I will be able to… • Understand and calculate the range of a data set. • Explain in my own words what a deviation is. • Calculate the variance and the standard deviation for a population or a sample.
The Range • The difference between the largest value and the smallest value in the data set: range = largest value – smallest value • Simplest measure of variability • Larger range is an indication of greater variability
Example 3.8 - Range of the volleyball teams’ heights Calculate the range of player heights for each of the WMU and NCU teams. FIGURE 3.11 Comparative dotplots of the heights of two volleyball teams.
Example 3.8 continued Solution • From Figure 3.11shows WMU heights are more spread out than NCU heights. • Range of WMU team should be larger than the range of the NCU team, reflecting greater variability. rangeWMU = 75 - 60 = 15 inches rangeNCU = 72 – 66 = 6 inches
What Is a Deviation? • A deviation for a given data value x is the difference between the data value and the mean of the data set. • For a sample, the deviation equals x - x. • For a population, the deviation equals x - m. • Data value is larger than the mean, the deviation will be positive
Deviation • Data value is smaller than the mean, the deviation will be negative • Data value equals the mean, the deviation will be zero • Deviation can roughly be thought of as the distance between a data value and the mean • The deviation can be negative while distance is always positive • Deviation not useful measure of spread because sum of deviations is always zero.
Population Variance s2 • Symbolized by the lowercase Greek letter sigma squared, s2 • Is the mean of the squared deviations in the population and is found by
The Population Standard Deviation s • The square root of the variance • Represents a distance from the mean that is representative for that data set • Not the mean deviation, which is always zero
Sample Variance s2 • Based on the idea of finding the sum of the squared deviations S(x – x)2 and then dividing by the sample size to get the mean squared deviation • Statisticians found a better estimate by dividing by n - 1
Sample Standard Deviation s • The square root of the sample variance s2 • Second most important statistic • The value of s may be interpreted as the typical difference between a data value and the sample mean
Computational Formulas Population variance: Population standard deviation: Sample variance: Sample standard deviation:
Example 3.15 - Calculating the population variance and population standard deviation using the calculator. Table 3.13 lists the amount of farmland (in 1000s of acres) in each county in the state of Connecticut. Since the data set contains all N = 8 counties in Connecticut, it can be considered a population. Calculate the population variance and population standard deviation using the calculator.
Example 3.15 continued Table 3.13 Farmland in Connecticut
Example 3.15 continued • The population standard deviation is therefore: • The standard deviation of farmland for all counties in Connecticut is almost 25,100 acres.
Summary • The simplest measure of variability, or measure of spread, is the range. • The range is simply the difference between the maximum and minimum values in a data set • The range has drawbacks because it relies on the two most extreme data values. • A deviation is the difference between a data value and the mean of the data values.
Summary • The variance and standard deviation are measures of spread that utilize all available data values. • The population variance can be thought of as the mean squared deviation. • The standard deviation is the square root of the variance. • Standard deviation is a typical deviation, that is, the typical difference between a data value and the mean.
3.4 Measures of Position Objectives: By the end of this section, I will be able to… • Find percentiles for both small and large data sets.
Percentile • Let p be any integer between 0 and 100. • The pth percentile of a data set is the data value at which p percent of the values in the data set are less than or equal to this value.
Example 3.24 - Finding percentiles of a small data set Yolanda would like to go to a prestigious graduate school of the arts. She knows that this school accepts only those students who score at the 75th percentile or higher in a grueling dance audition. The following data represent the dance audition scores of Yolanda’s group. Yolanda scored 85. Find the 75th percentile of the data set. Will Yolanda be accepted at the prestigious graduate school of the arts? 78 56 89 44 65 94 81 62 75 85 30 68
Example 3.24 continued Solution Step 1: Sort the data into ascending order 30 44 56 62 65 68 75 78 81 85 89 94 Step 2: Since we want the 75th percentile, p=75. There are 12 scores, so n=12. Calculate So, i = 9.
Example 3.24 continued Step 3: Here, since i is an integer, the 75th percentile is the mean of the data values in positions 9 and 10. • Data value in the ninth position is 81. • Data value in the tenth position is 85. • Mean of these values is 83. Thus, the 75th percentile is 83. Yolanda’s dance score of 85 is therefore above the 75th percentile. She will be accepted to the prestigious graduate school.
Outliers • Extremely large or extremely small data value relative to the rest of the data set • May represent a data entry error, or it may be genuine data • Farther than three standard deviations from the mean
Summary • Measures of position, which tell us the position that a particular data value holds relative to the rest of the data set. • The pth percentile of a dataset is the value at which p percent of the values in the data set are less than or equal to this value.
3.6 Robust Measures Objectives: By the end of this section, I will be able to… • Find quartiles and the interquartile range. • Calculate the five-number summary of a data set. • Construct a boxplot for a given data set. • Apply robust detection of outliers.
Quartiles • Divide the data set into quarters FIGURE 3.31
Quartiles • Each part contains 25% of the data. • The first quartile (Q1) is the 25th percentile. • The second quartile (Q2) is the 50th percentile, that is, the median. • The third quartile (Q3) is the 75th percentile. • For small data sets, the division may be into four parts of only approximately equal size.
Example 3.35 - Finding the quartiles for a small data set: the dance audition scores In Example 3.24 (page 126) we examined the dance scores of 12 students auditioning for admission into a prestigious graduate school of the arts. Recall that we found the 75th percentile of the dance audition scores to be 83. By definition, the 75th percentile is the third quartile Q3. Therefore, this score of 83 is also the third quartile (Q3) of the audition scores. Now we will find the first quartile and the median (second quartile).
Example 3.35 continued FIGURE 3.34 The quartiles for the dance audition data.
Interquartile Range • Interquartile range (IQR) is a robust measure of variability. • IQR = Q3 - Q1 • The interquartile range is interpreted to be the spread of the middle 50% of the data.
Example 3.37 - Finding the interquartile range for the dance audition scores In Example 3.35, we found that, for the dance audition score data, Q1 = 59 and Q3 = 83. Find the IQR for the dance score data and explain what it means.
Example 3.37 continued Solution • We would say that the middle 50%, or middle half, of the dance audition scores ranged over 24 points (see Figure 3.38). FIGURE 3.38 The IQR for the dance audition data.