1 / 39

Those who don’t know statistics are condemned to reinvent it… David Freedman

Those who don’t know statistics are condemned to reinvent it… David Freedman. All you ever wanted to know about the histogram and more . 1. 400. 300. 200. 100. 0. 0.0. 10.0. 20.0. 30.0. 40.0. 50.0. 60.0. 70.0. 80.0. 90.0. 5.0. 15.0. 25.0. 35.0. 45.0. 55.0. 65.0. 75.0.

lazzaro
Download Presentation

Those who don’t know statistics are condemned to reinvent it… David Freedman

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Those who don’t know statistics are condemned to reinvent it… David Freedman

  2. All you ever wanted to know about the histogram and more ...

  3. 1 400 300 200 100 0 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0 Distribution of No of Graphics on web pages (N=1873) Mean = 17.93 Median = 16.00 Std. Dev = 17.92 N = 1873 Graphic Count

  4. 2 Horizontal Scale

  5. 3 1000 800 600 400 200 0 0.0 40.0 80.0 160.0 400.0 200.0 240.0 120.0 280.0 360.0 440.0 320.0 480.0 Distribution of Redundant Link % on web pages (N =1861) Mean = 22.1 Median = 14 Std. Dev = 37.33 N = 1861.00

  6. Plotting a histogram:endpoint convention, plot frequencies, make equal intervals etc.

  7. 4 Frequency Table convention: include the left endpoint in the class interval

  8. Frequency/Probability

  9. 5 1000/ .5 800/ .4 600/ .3 400/ .2 200/ .1 0/ 0 1 3 5 7 9 11 13 15 Frequency 110 430 860 280 180 40 20 10 Probability .06 .22 .45 .15 .09 .02 .01 .01 No of fonts used on a web-page Frequency /probability

  10. Cleaning up a histogram: getting rid of outliers

  11. 1600 1400 1200 1000 800 600 400 200 0 0.0 4000.0 2000.0 6000.0 8000.0 18000.0 12000.0 16000.0 10000.0 14000.0 20000.0 Distribution of word count (N=1903) Mean = 393.2 Median = 223 Std. Dev = 725.24 Minimum = 0 Maximum = 20,357

  12. 7 800 600 400 200 0 0.0 800.0 400.0 1200.0 2000.0 3200.0 1600.0 2400.0 2800.0 3600.0 4000.0 Distribution of word count (N=1897) top six removed Mean = 368.0 Median = 223 Std. Dev = 474.04 Minimum = 0 Maximum = 4132

  13. 500 400 300 200 100 0 0.0 800.0 400.0 600.0 200.0 1600.0 1000.0 2200.0 2400.0 1400.0 1800.0 1200.0 2000.0 Distribution of word count (N=1873) Mean = 333.4 Median = 220 Std. Dev = 360.30 Minimum = 0 Maximum = 4132 WORDCNT2

  14. What can histograms tell you

  15. 8 3 0 0 2 0 0 1 0 0 0 0 . 0 4 0 . 0 8 0 . 0 1 2 0 . 0 1 6 0 . 0 2 0 0 . 0 2 4 0 . 0 2 8 0 . 0 Distribution of link count on good & bad web-pages Good Sites Bad Sites

  16. 9 Making inferences from histograms: Incidence of riots and temperature 3 0 4 0 9 0 1 0 0 1 1 0 5 0 6 0 7 0 8 0 temperature

  17. Mean and Median Mean is arithmetic average, median is 50% point Mean is point where graph balances • Mean shifts around, • Median does not shift much, is more stable • Computing Median: • for odd numbered N • find middle number • For even numbered N • interpolate between middle 2, • e.g. if it is 7 and 9, then 8 is the median

  18. The instability of means and standard deviations

  19. Add two numbers: watch the mean, median, & SD

  20. Add one outlier...

  21. Standard Deviation: a measure of spread

  22. 10 Same mean, different spread S D S D

  23. The Standard Deviation

  24. The SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around one SD away from the average. Very few will be more than two or three SD’s away.

  25. Understanding the standard deviation Lets start with a list: 1, 2, 2, 3 50% 25% 0% Histogram is symmetric about 2, 2 is mean, and 50% to left of 2, 50% to right

  26. 50% 25% 0% List: 1, 2, 2, 3 Average = 2 SD = .8 50% List: 1, 2, 2, 5 Average =2.5 SD = 1.73 25% 0% 50% List: 1, 2, 2, 7 Average =3 SD = 2.71 25% 0%

  27. Computing the standard deviation List: 20, 10, 15, 15 Average = 15 Find deviations from average= 5, -5, 0, 0 Square the deviations: (5)2 (-5)2 (0)2 (0)2 = 50 divide it by N-1 = 50/3 = 16.67 Square root it= 16.67 = 4.08

  28. Properties of the standard deviation • The standard deviation is in the same units as the mean • The standard deviation is inversely related to sample size (therefore as a measure of spread it is biased) • In normally distributed data 68% of the sample lies within 1 SD

  29. Properties of the Normal Probability Curve • The graph is symmetric about the mean (the part to the right is a mirror image of the part to the left) • The total area under the curve equals 100% • Curve is always above horizontal axis • Appears to stop after a certain point (the curve gets really low)

  30. 11 1 SD= 68% 2 SD = 95% 3 SD= 99.7% • The graph is symmetric about the mean = • The total area under the curve equals 100% • Mean to 1 SD = +- 68% • Mean to 2 SD = +- 95% • Mean to 3 SD = +- 99.7% • You can disregard rest of curve

  31. 12 500 400 300 200 100 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 Distribution of judges ratings for the Webby Awards Mean = 6.3 Median = 6.3 Std. Dev = 1.98 N = 1867.00 Skewness = -.43 Kurtosis = -.201

  32. It is a remarkable fact that many histograms in real life tend to follow the Normal Curve. For such histograms, the mean and SD are good summary statistics. The average pins down the center, while the SD gives the spread. For histogram which do not follow the normal Curve, the mean and SD are not good summary statistics. What when the histogram is not normal ...

  33. 13 500 400 300 200 100 0 0.0 200.0 800.0 400.0 600.0 2800.0 1000.0 1200.0 1600.0 1800.0 2600.0 1400.0 2000.0 2200.0 2400.0 Distribution of word count on web pages Std. Dev = 384.83 Mean = 348.3 +- 3 SD = (384 * 3) = 1152 Mean - 1152 = about 30% sample had negative number of links

  34. When SD is influenced by outliers Use inter quartile range 75th percentile - 25th percentile Note. A percentile is a score below which a certain % of sample is

  35. 14 Measures of Normality • Visual examination • Skewness: measure of symmetry Symmetric Positively Skewed Negatively Skewed

  36. 15 Kurtosis: Does it cluster in the middle? • Kurtosis is based on a distributions tail. • Distributions with a large tail: leptokurtic • Distributions with a small tail: platykurtic • Distributions with a normal tail: mesokurtic Large tail Small tail Normal Tail

  37. 1600 1400 1200 1000 800 600 400 200 0 0.0 2000.0 4000.0 6000.0 8000.0 14000.0 16000.0 20000.0 10000.0 12000.0 18000.0 Positively Skewed and Leptokurtic: Word Count Mean = 393.2 Median = 223 Std. Dev = 725.24 Skewness = 13.62 Kurtosis = 321.84 N = 1903.00

  38. 800 600 400 200 0 0.0 800.0 400.0 1200.0 2000.0 3200.0 1600.0 2400.0 2800.0 3600.0 4000.0 Distribution of word count (N=1897) top six removed Kurtosis = 16.40 Skewness = 3.49 Mean = 368.0 Median = 223 Std. Dev = 474.04 N = 1897.00

  39. Degree of Freedom • The number of independent pieces of information remaining after estimating one or more parameters • Example: List= 1, 2, 3, 4 Average= 2.5 • For average to remain the same three of the numbers can be anything you want, fourth is fixed • New List = 1, 5, 2.5, __ Average = 2.5

More Related