1 / 91

45-733: lecture 1

45-733: lecture 1. Topics. Administrative matters What is statistics and why should you care (chapter 1) Presenting data (chapter 2). Administrative. Instructor Bill Vogt wilibear@andrew.cmu.edu Office: Hamburg Hall, 2116D Office phone: (412) 268-1843 Office hours:

gregd
Download Presentation

45-733: lecture 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 45-733: lecture 1 William B. Vogt

  2. Topics • Administrative matters • What is statistics and why should you care (chapter 1) • Presenting data (chapter 2) William B. Vogt

  3. Administrative • Instructor • Bill Vogt • wilibear@andrew.cmu.edu • Office: Hamburg Hall, 2116D • Office phone: (412) 268-1843 • Office hours: • Tuesday, Thursday 5-6pm • By appointment William B. Vogt

  4. Administrative • Grading • Homework, midterm, final equal weight • Cooperation • Unlimited cooperation on homeworks • Exams are open book, notes, etc • No cooperation on exams William B. Vogt

  5. Administrative • Software • Everything may be done with excel • You may use any software you like • Office hours • Tuesday 5-6pm • Thursday 5-6pm • Others by appointment William B. Vogt

  6. Administrative • Web site • http://www.andrew.cmu.edu/course/45-733/index.htm • Lecture PowerPoint slides available by clicking on relevant date’s topic • Homework • Distributed via web site, solutions also • Due in class according to syllabus • Returned in class next meeting William B. Vogt

  7. Administrative • Special class meeting • February 19, 8-9:50pm, GSIA 152 • Sections M,F meet simultaneously • Office hours • Tuesday 5-6pm • Thursday 5-6pm • Others by appointment William B. Vogt

  8. What is statistics? • Systematic methods to analyze and present numerical information • OR, a systematic way of discussing both our knowledge and our ignorance arising from numerical information William B. Vogt

  9. Who cares about statistics? • Increasing relevance of numerical data • Importance of correctly assessing information • Use numerical data to construct good estimates • Know the limitations of the estimates William B. Vogt

  10. What is statistics: Systematic • “Most Americans like our product” vs.“59% 0.6% prefer our product to our leading competitor’s” • “The economy will contract” vs. “GDP will contract by 2.2% 0.4%” • “Women are more likely to buy our product” vs. “Women are 15% 3% more likely to buy our product than are men” William B. Vogt

  11. What is statistics: Analyze • We want to know about a population • Real populations • Population of people in the US • Population of our customers • Population of units off of our production line • Imaginary populations • Population of ways economy might work • Population of ways consumers might react to our new product William B. Vogt

  12. What is statistics: Analyze • What we want to know about population • How big/small is some quantity • Average income • % who approve of G.W. Bush • Market share of Intel in x86 PC processors William B. Vogt

  13. What is statistics: Analyze • What we want to know about population • Does quantity differ in different groups • Average income of Northern vs. Southern households • Average income of Target vs. Walmart shoppers • Market share of Intel in desktop vs. mobile x86 PC processors William B. Vogt

  14. What is statistics: Analyze • What we want to know about population • How are two/more variables related • As income rises, how much does consumption of Starbucks coffee rise? • As family size rises, how do sensitivities to price and advertising change? • As people age, how does their sensitivity to advertising change? William B. Vogt

  15. What is statistics: Analyze • How can we know these things? • Collect a census • Accurate information on the whole population • Ask everyone in the US their income • Audit their answers carefully • This is always expensive and often impossible • Imaginary populations? • Parallel universes? William B. Vogt

  16. What is statistics: Analyze • How can we know these things? • Collect a sample • Sample: A few members of a population. • Accurate information on the sample • Ask 100 people in the US their income • Audit their answers carefully? • But knowing all about the sample  knowing all about the population! William B. Vogt

  17. What is statistics: Analyze • Going from sample to population, we hope for: • A good description of the sample • An estimate (“informed guess”) of what we want to know about the population • A statement about how far off our estimate might be William B. Vogt

  18. What is statistics: Analyze • Population and sample, and example • What is avg household income in US? • Phone survey of 100 completed households, asking their incomes • Sample = {$50K, $23K, … , 180K} • Average sampled income is, say, $53K • My estimate of US avg household income is $53K and I am 95% sure that it is in the range $53K $3K William B. Vogt

  19. What is statistics? • Description of a sample • Analysis: estimation of quantities of interest for a population • Levels • Differences • Relationships • Analysis: statement of how far off the estimates might be William B. Vogt

  20. Data description • Topic of chapter 2 is describing data • This is the part of the definition of statistics in which we describe our data • This is also the part of our goals in sampling in which we describe our sample accurately • Topic of the rest of the book/course will be analysis William B. Vogt

  21. Data Description: Population and Sample • Population • All of the relevant people/units you are interested in • For example • Population of people in the US • Population of our customers • Population of light bulbs from our production facility William B. Vogt

  22. Data Description: Population and Sample • Sample • A subset of a population. Only a few of the units we are interested in. • For example: • Survey of 100 people in the US • Survey of 5 of our customers • 1 in 100,000 of the light bulbs from our production facility William B. Vogt

  23. Data Description: Dataset • A dataset is just a group of numbers measuring something • Could be a population • Could be a sample • Examples • A list of all the incomes of all the households in the US • A list of the incomes of 5 of our customers • A list of the time to failure of 500 light bulbs from our production facility William B. Vogt

  24. Data Description: Dataset • Notation • Dataset = {4,5,12,3,…,0} • Dataset = {x1, x2, x3, … , xN} • xi = any arbitrary one of the elements in our dataset • N = the number of elements (observations) in our dataset William B. Vogt

  25. Data Description: Dataset • Notation • Example: • Dataset = {4,5,12,3} • x1=4 • x3=12 • N=4 William B. Vogt

  26. Data Description: Measures of central tendency • Measures of central tendency • Measure where the “middle” of the data are • Useful if you want to know what an average or typical member of your sample/population looks like William B. Vogt

  27. Data Description: Measures of central tendency • Mean • Also known as average • Calculated by adding up all the observations and dividing by the number of observations William B. Vogt

  28. Data Description: Measures of central tendency • Mean • Can also be written: William B. Vogt

  29. Data Description: Measures of central tendency • Mean • Example: • Dataset = {53,45,23,19,87} • Mean William B. Vogt

  30. Data Description: Measures of central tendency • Median • Also known as the 50th percentile • Is the point in the data where half of the observations are greater and half are lesser. • Calculated by sorting the data and choosing the middle value William B. Vogt

  31. Data Description: Measures of central tendency • Median • Example: • Dataset = {53,45,23,19,87} • Dataset sorted = {19,23,45,53,87} • Median = 45 William B. Vogt

  32. Data Description: Measures of central tendency • Median • Example: • Dataset = {53,45,23,19,87,100} • Dataset sorted = {19,23,45,53,87,100} • Median = (45+53)/2 = 49 William B. Vogt

  33. Data Description: Measures of central tendency • Percentiles • The 25th percentile is the point at which 25% of the data are lesser and 75% of the data are greater • The 75th percentile is the point at which 75% of the data are lesser and 25% of the data are greater • The Yth percentile is the point at which Y% of the data are lesser and (100-Y)% of the data are greater • The median is the 50th percentile William B. Vogt

  34. Data Description: Measures of central tendency • Percentiles • Calculation • Sort the dataset • The 25th percentile is observation (N+1)/4 or 0.25*(N+1) • The 75th percentile is observation 3*(N+1)/4 or 0.75*(N+1) • The Yth percentile is observation (Y/100)*(N+1) • Use interpolation if (Y/100)*(N+1) is not a whole number William B. Vogt

  35. Data Description: Measures of central tendency • Percentiles • Calculation • Use interpolation if (Y/100)*(N+1) is not a whole number • If there are 10 observations, and you want the 44th percentile • 0.44*(10+1)=4.84 • So, the 44th percentile will be the number 84% of the way between observations 4 and 5 • 44th percentile = 0.16*x4 + 0.84* x5 William B. Vogt

  36. Data Description: Measures of central tendency • Percentiles • Example • Dataset = {45,23,110,19,87,36,100} • Sorted dataset = {19,23,36,45,87,100,110} • 25th percentile=23 • 75th percentile=100 • 50th percentile (median)=45 William B. Vogt

  37. Data Description: Measures of central tendency • Percentiles • Example • Dataset = {45,23,110,19,87,36,100} • Sorted dataset = {19,23,36,45,87,100,110} • 82nd percentile? • 0.82(8)=6.56 • 82nd percentile = 0.44*x6 + 0.56* x7 • 82nd percentile = 0.44*100+ 0.56*110 = 105.6 William B. Vogt

  38. Data Description: Measures of central tendency • Mode • The most common value in the dataset • Might think of as the most typical value • Example • Dataset = {53,45,45,23,19,87,100} • Mode = 45 • Example • Dataset = {53,45,45,23,19,87,87} • Mode = 45 and 87 --- data are bimodal William B. Vogt

  39. Data Description: Measures of dispersion • Measures of dispersion tell us how “spread out” our data are: • Compare: • Dataset 1 = {53,45,23,19,87} • Dataset 2 = {44,47,43,45,48} • Both have a mean of 45.4 • Dataset 1 is more spread out, however William B. Vogt

  40. Data Description: Measures of dispersion • Let’s display the datasets graphically: DS1: 19 45 87 DS2: William B. Vogt

  41. Data Description: Measures of dispersion • A good measure of dispersion will, for example, be bigger for DS1 than for DS2 DS1: 19 45 87 DS2: William B. Vogt

  42. Data Description: Measures of dispersion • “Average deviation” • One way to think about dispersion is to ask how far, on average, points are from the mean • Call di the deviation from the mean: William B. Vogt

  43. Data Description: Measures of dispersion • “Average deviation” • Dataset 1 = {53,45,23,19,87} • Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} • Maybe the average of the di will be a good measure of dispersion • (7.6-0.4-22.4-26.4+41.6)/5 = 0 • Hmmm. William B. Vogt

  44. Data Description: Measures of dispersion • “Average deviation” • Does this always happen? William B. Vogt

  45. Data Description: Measures of dispersion • “Average deviation” • What happened? • Average deviation offsets negative deviations against positive deviations. • So observations below the mean make this measure smaller, while obs above the mean make it bigger. • Both kinds should count as positive William B. Vogt

  46. Data Description: Measures of dispersion • Mean absolute deviation William B. Vogt

  47. Data Description: Measures of dispersion • Mean absolute deviation • Example • Dataset 1 = {53,45,23,19,87} • Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} • Absolute Dev 1 = {7.6,0.4,22.4,26.4,41.6} • MAD=(7.6+0.4+22.4+26.4+41.6)/5=19.68 William B. Vogt

  48. Data Description: Measures of dispersion • Mean absolute deviation • Example • Dataset 2 = {44,47,43,45,48} • Deviations 2 = {-1.4,1.6,-2.4,-0.4,2.6} • Absolute Dev 2 = {1.4,1.6,2.4,0.4,2.6} • MAD=(1.4+1.6+2.4+0.4+2.6)/5=1.68 William B. Vogt

  49. Data Description: Measures of dispersion • Mean absolute deviation MAD=19.68 DS1: 19 45 87 MAD=1.68 DS2: William B. Vogt

  50. Data Description: Measures of dispersion • Variance • Solves the problem of negative deviations in a different way • Variance calls for the deviations to be squared: William B. Vogt

More Related