360 likes | 542 Views
Chapter 4: Summary Statistics. In Chapter 4:. The prior chapter used stemplots and histograms to look at the shape, location, and spread of a distribution. This chapter uses numerical summaries for similar purposes. Summary Statistics. Central location Mean Median Mode Spread
E N D
In Chapter 4: The prior chapter used stemplots and histograms to look at the shape, location, and spread of a distribution. This chapter uses numerical summaries for similar purposes.
Summary Statistics • Central location • Mean • Median • Mode • Spread • Range and interquartile range (IQR) • Variance and standard deviation • Shape summaries • seldom used in practice • not covered
Notation • n sample size • X the variable (e.g., ages of subjects) • xi the value of individual i for variable X • sum all values (capital sigma) • Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n= 10 X = AGE variable x1= 21, x2= 42, …, x10= 52 xi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290
§4.1: Central Location: Sample Mean • “Arithmetic average” • Traditional measure of central location • Sum the values and divide by n • “xbar” refers to the sample mean
Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, xi = 21 + 42 + … + 52 = 290, and The sample mean is the gravitational center of a distribution
Uses of the Sample Mean The sample mean can be used to predict: • The value of an observation drawn at random from the sample • The value of an observation drawn at random from the population • The population mean
Population Mean • Same operation as sample mean except based on entire population (N ≡ population size) • Conceptually important • Usually not available in practice • Sometimes referred to as the expected value
§4.2 Central Location: Median The median is the value with a depthof (n+1)/2 When n is even, average the two values that straddle a depth of (n+1)/2 For the 10 values listed below, the median has depth (10+1) / 2 = 5.5, placing it between 27 and 28. Average these two values to get median = 27.5 05 11 21 24 27 28 30 42 50 52median Average the adjacent values: M = 27.5
More Examples of Medians • Example A: 2 4 6 Median = 4 • Example B: 2 4 6 8 Median = 5 (average of 4 and 6) • Example C: 6 2 4 Median 2 (Values must be ordered first)
The Median is Robust The median is more resistant to skews and outliers than the mean; it is more robust. This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 Here’s the same data set with a data entry error “outlier” (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 The median is 1614 in both instances, demonstrating its robustness in the face of outliers.
§4.3: Mode • The mode is the most commonly encountered value in the dataset • This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} • This data set has no mode {4, 6, 7, 8} (each point appears only once) • The mode is useful only in large data sets with repeating values
§4.4: Comparison of Mean, Median, Mode Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew
§4.5 Spread: Quartiles • Two distributions can be quite different yet can have the same mean • This data compares particulate matter in air samples (μg/m3) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean. Site 1| |Site 2---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
Spread: Range • Range = maximum – minimum • Illustrative example: Site 1 range = 86 – 22 = 64 Site 2 range = 40 – 32 = 8 • Beware: the sample range will tend to underestimate the population range. • Always supplement the range with at least one addition measure of spread Site 1| |Site 2---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
Spread: Quartiles • Quartile 1 (Q1): cuts off bottom quarter of data = median of the lower half of the data set • Quartile 3 (Q3): cuts off top quarter of data = median of the to half of the data set • Interquartile Range (IQR) = Q3 – Q1 covers the middle 50% of the distribution 05 11 21 24 27 28 30 42 50 52 Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
Quartiles (Tukey’s Hinges) – Example 2Data are metabolic rates (cal/day), n = 7 When n is odd, include the median in both halves of the data set. Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) Top half: 1614 1666 1792 1867which has a median of 1729 (Q3) 1362 1439 1460 1614 1666 1792 1867 median
Five-Point Summary • Q0 (the minimum) • Q1 (25th percentile) • Q2 (median) • Q3 (75th percentile) • Q4 (the maximum)
§4.6 Boxplots • Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median • Calculate IQR and fences as follows:FenceLower = Q1 – 1.5(IQR)FenceUpper = Q3 + 1.5(IQR)Do not draw fences • Determine if any values lie outside the fences (outside values). If so, plot these separately. • Determine values inside the fences (inside values)Draw whisker from Q3 to upper inside value.Draw whisker from Q1 to lower inside value
60 Upper inside = 52 50 Q3 = 42 40 30 Q2 = 27.5 Q1 = 21 20 10 Lower inside = 5 0 Illustrative Example: Boxplot Data: 05 11 21 24 27 28 30 42 50 52 • 5 pt summary: {5, 21, 27.5, 42, 52};box from 21 to 42 with line @ 27.5 • IQR = 42 – 21 = 21.FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = –10.5 • None values above upper fence None values below lower fence • Upper inside value = 52Lower inside value = 5Draws whiskers
Illustrative Example: Boxplot 2 Data: 3 21 22 24 25 26 28 29 31 51 5-point summary: 3, 22, 25.5, 29, 51: draw box IQR = 29 – 22 = 7FU = Q3 + 1.5(IQR) = 28 + (1.5)(7) = 39.5FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6 One above top fence (51)One below bottom fence (3) Upper inside value is 31Lower inside value is 21Draw whiskers
Illustrative Example: Boxplot 3 Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867 5-point summary: 1362, 1449.5, 1614, 1729, 1867 IQR = 1729 – 1449.5 = 279.5 FU = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 FL = Q1 – 1.5(IQR) = 1449.5 – (1.5)(279.5) = 1030.25 3. None outside 4. Whiskers end @ 1867 and 1362
Boxplots: Interpretation • Location • Position of median • Position of box • Spread • Hinge-spread (IQR) • Whisker-to-whisker spread • Range • Shape • Symmetry or direction of skew • Long whiskers (tails) indicate leptokurtosis
Side-by-side boxplots Boxplots are especially useful when comparing groups
This data set has a mean of 36. The data point 33 has a deviation of 33 – 36 = −3. The data point 40 has a deviation of 40 – 36 = 4. §4.7 Spread: Standard Deviation • Most common descriptive measures of spread • Based on deviations around the mean. • This figure demonstrates the deviations of two of its values
Deviation = Sum of squared deviations = Sample variance = Sample standard deviation = Variance and Standard Deviation
Sum of Squares Standard deviation (formula) Sample standard deviation s is the estimator of population standard deviation . See “Facts About the Standard Deviation” p. 80.
Illustrative Example: Standard Deviation (p. 79) * Sum of deviations always equals zero
Illustrative Example (cont.) Sample variance (s2) Standard deviation (s)
Interpretation of Standard Deviation • Measure spread (e.g., if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability) • 68-95-99.7 rule (next slide) • Chebychev’s rule (two slides hence)
68-95-99.7 Rule Normal Distributions Only! • 68% of data in the range μ ± σ • 95% of data in the range μ ± 2σ • 99.7% of data the range μ ± 3σ • Example. Suppose a variable has a Normal distribution with = 30 and σ = 10. Then: 68% of values are between 30 ± 10 = 20 to 40 95% are between 30 ± (2)(10) = 30 ± 20 = 10 to 50 99.7% are between 30 ± (3)(10) = 30 ± 30 = 0 to 60
Chebychev’s Rule All Distributions • Chebychev’s rule says that at least 75% of the values will fall in the range μ ± 2σ (for any shaped distribution) • Example: A distribution with μ= 30 and σ = 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50
Rules for Rounding • Carry at least four significant digits during calculations. (Click here to learn about significant digits.) • Round as last step of operation • Avoid pseudo-precision • When in doubt, use the APA Publication Manual • Always report units Always use common sense and good judgment.
Choosing Summary Statistics • Always report a measure of central location, a measure of spread, and the sample size • Symmetrical mound-shaped distributions report mean and standard deviation • Odd shaped distributions report 5-point summaries (or median and IQR)
Software and Calculators Use software and calculators to check work.