1 / 67

Week 2 September 8-12

Week 2 September 8-12. Five Mini-Lectures QMM 510 Fall 2014 . Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data

daryl
Download Presentation

Week 2 September 8-12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Week 2 September 8-12 Five Mini-Lectures QMM 510 Fall 2014

  2. Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data 4.8 Skewness and Kurtosis Describing Data Numerically ML 2.1 Chapter 4 So many topics, so little time …

  3. Center, Variability, Shape Chapter 4 Three key characteristics of numerical data:

  4. Visual Description Chapter 4

  5. Measures of Center Chapter 4 Mean • A familiar measure of center • Excel function =AVERAGE(Data) where Data is an array of data values.

  6. Measures of Center Chapter 4 Median • The median (M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower halves of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array.

  7. Measures of Center Chapter 4 Mode • The most frequently occurring data value. • Familiar and easy to understand. • But - data may have multiple modes or no mode. • Most useful for discrete or categorical data with only a few values.Rarely useful for continuous data or data with a wide range. Example: Revenue growth in 32 bio-tech companies last year. Caution: In decimal data, some data values may occur more than once, but this is likely due to chance (not central tendency). Excel’s =MODE(Data) returns only the first mode (1.71 in this example).

  8. Measures of Center Chapter 4 • Compare mean and median or look at the histogram to determine degree of skewness. • Figure 4.10 shows prototype population shapes showing varying degrees of skewness.

  9. Measures of Center Chapter 4 Geometric Mean • The geometric mean (G) is a multiplicative average. In Excel =GEOMEAN(Data) or =(2*3*7*9*10*12)^(1/6) Growth Rates A variation on the geometric mean used to find the average growth rate for a time series.

  10. Measures of Center Chapter 4 Growth Rates • For example, from 2006 to 2010, JetBlue Airlines revenues are: The average growth rate: or 12.5 % per year.

  11. Measures of Center Chapter 4 Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. • For the J.D. Power quality data: • Here, the midrange (126.5) is higher than the mean (114.70) or median (113).

  12. Measures of Center Chapter 4 Trimmed Mean • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations. • So, we would remove the two smallest and two largest observations before averaging the remaining values.

  13. Measures of Center Chapter 4 Trimmed Mean • Here is a summary of all the measures of central tendency for the J.D. Power data, along with Excel functions. • The trimmed mean mitigates the effects of very high values.

  14. Measures of Variability Chapter 4 Variability is the “spread” of data points about the center of the distribution in a sample. Measures of Variability

  15. Measures of Variability Chapter 4 Population standard deviation Population variance

  16. Measures of Variability Chapter 4

  17. Measures of Variability Chapter 4 Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unit-free measure of dispersion. • Expressed as a percent of the mean. • Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.

  18. Measures of Variability Chapter 4 Example: Class scores on 16-point quiz on first day of class and after students had an opportunity to review the material. Caution: Only appropriate for nonnegative data. CV is undefined if the mean is zero or negative (this could happen, for example, if stocks in a portfolio had negative rates of return).

  19. Standardized Data ML 2.2 Chapter 4 • Topics • sorting, standardizing, z-scores • normal distribution as a benchmark • Empirical Rule (MegaStat) • outliers and unusual observations • Excel functions (Appendix J) • examples: birth weight, voting • using MegaStat and Minitab

  20. The Empirical Rule Chapter 4 • The normal distribution is symmetric and is also known as the • bell-shaped curve. • The Empirical Rulestates that for data from a normal distribution, • we expect the interval  ± k to contain a known percentage • of observed data: k = 1 68.26% will lie within m+ 1s k = 2 95.44% will lie within m+ 2s k = 3 99.73% will lie within m+ 3s

  21. Standardized Data Chapter 4 The Empirical Rule Note:No upper bound is given. Data values outside m+ 3sare rare.

  22. Standardized Data Chapter 4 • A standardized variable (Z) redefines each observation in terms of the number of standard deviations from the mean. A negative z value means the observation is to the left of the mean. Standardization formula for a population: Positive z means the observation is to the right of the mean. Standardization formula for a sample (for n > 30):

  23. Standardized Data Chapter 4

  24. Standardized Data Chapter 4 Example: Birth Weights (n = 1429) Resembles a normal except for the low tail (a few extremely tiny babies). SourceBirth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill. • 5 pound baby’s z-score: z = (80-116.14)/21.96 = -1.65 • 8 pound baby’s z-score: z= (144-116.14)/21.96 = 1.27 • 11 pound baby’s z-score: z= (176-116.14)/21.96 = 2.73

  25. Standardized Data Chapter 4 Example: Voting in 2004 Presidential Election) Use Excel’s function =STANDARDIZE(x, μ, σ) Only two states stand out as unusual Note: Sorting the data values allows you to see the extremes. Values within μ ±1σ are not less interesting.

  26. Excel Chapter 4 Voting percent in 50 states Note: In Excel’s Descriptive Statistics, you can’t choose the statistics displayed.

  27. MegaStat Chapter 4 Note: You can choose the statistics displayed (e.g.,Empirical Rule). Voting percent in 50 states

  28. Appendix J: Excel Functions Chapter 4

  29. Appendix J: Excel Functions Chapter 4

  30. QuantilesML 2.3 Chapter 4 • Topics • percentiles, quartiles, boxplots • fences, another view of outliers • examples: birth weight. City MPG

  31. Percentiles, Quartiles, and Box-Plots Chapter 4 Percentiles • Percentilesare data that have been divided into 100 groups. For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. • Deciles are data that have been divided into10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups.

  32. Percentiles, Quartiles, and Box-Plots Chapter 4 Percentiles • Percentiles may be used to establish benchmarks for comparison purposes (e.g. health care, manufacturing, and banking industries use 5th, 25th, 50th, 75th and 90th percentiles). • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles can be used in employee merit evaluation and salary benchmarking.

  33. Percentiles, Quartiles, and Box-Plots Chapter 4 Quartiles • Quartiles are scale points that divide the sorted data into four groups of approximately equal size. The three values that separate the four groups are called Q1, Q2, and Q3.

  34. Percentiles, Quartiles, and Box-Plots Chapter 4 Quartiles • The second quartile Q2 is the median, a measure of central tendency.

  35. Percentiles, Quartiles, and Box-Plots Chapter 4 Method of Medians • For small data sets, find quartiles using method of medians: Step 1: Sort the observations. Step 2: Find the median Q2. Step 3: Find the median of the data values that lie belowQ2. Step 4: Find the median of the data values that lie aboveQ2.

  36. For first half of data, 50% above, 50% below Q1. For second half of data, 50% above, 50% below Q3. Percentiles, Quartiles, and Box-Plots Chapter 4 Quartiles – The method of medians • The first quartile Q1 is the median of the data values below Q2 • The third quartile Q3 is the median of the data values above Q2.

  37. Percentiles, Quartiles, and Box-Plots Chapter 4 Method of Medians Example:

  38. Xmin, Q1, Q2, Q3, Xmax 7 27 35.5 40.5 49 Percentiles, Quartiles, and Box-Plots Chapter 4 Box Plots • A useful tool of exploratory data analysis(EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • For the previous P/E ratios example:

  39. Percentiles, Quartiles, and Box-Plots Chapter 4 Box Plots • The box plot is displayed visually, like this.

  40. Percentiles, Quartiles, and Box-Plots Chapter 4 Box Plots

  41. Percentiles, Quartiles, and Box-Plots Chapter 4 Box Plots: Midhinge • The average of the first and third quartiles. The name midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”.

  42. Percentiles, Quartiles, and Box-Plots Chapter 4 Box Plots: Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: • Values outside the inner fences are unusualwhile those outside the outer fences are outliers.

  43. Box-Plots with Fences Chapter 4 SourceBirth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill. Example: Birth Weights (n = 1429) Note: The middle 50% of birth weights lie within a small range (105 to 130, or about 6.56 lb to 8.13 lbs). But there are extremes on the low end.

  44. Box-Plots with Fences Chapter 4 Fences Visualized: Fences Example: Interpretation: There are three outliers (beyond the inner upper fence). One is on the border of the upper outer fence, so is almost an extreme outlier. Lower fences are not displayed since they are irrelevant for this sample.

  45. Box-Plots with Fences Chapter 4 Example: Fences and Unusual Data Values Outlier Interpretation: Based on the fences, there is only one outlier and no extreme outliers. Lower fences are not displayed since they are not needed for this sample.

  46. Correlation, Grouped Data, Shape ML 2.4 Chapter 4 • Topics • scatter plots • correlation coefficient • covariance – population, sample • mean from grouped mean • skewness, kurtosis (Excel)

  47. Correlation and Covariance Chapter 4 Correlation Coefficient The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Note: -1 ≤ r ≤ +1 Perfect negative correlation Perfect positive correlation

  48. Correlation and Covariance Chapter 4 Illustration of Correlation Coefficients

  49. Correlation and Covariance Chapter 4 Correlation Coefficient: Examples Note: -1 ≤ r ≤ +1 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y. X = gestation (months), Y = birth weight (oz) X = car weight (lbs), Y = city MPG

  50. Correlation and Covariance Chapter 4 Correlation Coefficient: Example Note: -1 ≤ r ≤ +1 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y.

More Related