910 likes | 918 Views
45-733: lecture 1. Topics. Administrative matters What is statistics and why should you care (chapter 1) Presenting data (chapter 2). Administrative. Instructor Bill Vogt wilibear@andrew.cmu.edu Office: Hamburg Hall, 2116D Office phone: (412) 268-1843 Office hours:
E N D
45-733: lecture 1 William B. Vogt
Topics • Administrative matters • What is statistics and why should you care (chapter 1) • Presenting data (chapter 2) William B. Vogt
Administrative • Instructor • Bill Vogt • wilibear@andrew.cmu.edu • Office: Hamburg Hall, 2116D • Office phone: (412) 268-1843 • Office hours: • Tuesday, Thursday 5-6pm • By appointment William B. Vogt
Administrative • Grading • Homework, midterm, final equal weight • Cooperation • Unlimited cooperation on homeworks • Exams are open book, notes, etc • No cooperation on exams William B. Vogt
Administrative • Software • Everything may be done with excel • You may use any software you like • Office hours • Tuesday 5-6pm • Thursday 5-6pm • Others by appointment William B. Vogt
Administrative • Web site • http://www.andrew.cmu.edu/course/45-733/index.htm • Lecture PowerPoint slides available by clicking on relevant date’s topic • Homework • Distributed via web site, solutions also • Due in class according to syllabus • Returned in class next meeting William B. Vogt
Administrative • Special class meeting • February 19, 8-9:50pm, GSIA 152 • Sections M,F meet simultaneously • Office hours • Tuesday 5-6pm • Thursday 5-6pm • Others by appointment William B. Vogt
What is statistics? • Systematic methods to analyze and present numerical information • OR, a systematic way of discussing both our knowledge and our ignorance arising from numerical information William B. Vogt
Who cares about statistics? • Increasing relevance of numerical data • Importance of correctly assessing information • Use numerical data to construct good estimates • Know the limitations of the estimates William B. Vogt
What is statistics: Systematic • “Most Americans like our product” vs.“59% 0.6% prefer our product to our leading competitor’s” • “The economy will contract” vs. “GDP will contract by 2.2% 0.4%” • “Women are more likely to buy our product” vs. “Women are 15% 3% more likely to buy our product than are men” William B. Vogt
What is statistics: Analyze • We want to know about a population • Real populations • Population of people in the US • Population of our customers • Population of units off of our production line • Imaginary populations • Population of ways economy might work • Population of ways consumers might react to our new product William B. Vogt
What is statistics: Analyze • What we want to know about population • How big/small is some quantity • Average income • % who approve of G.W. Bush • Market share of Intel in x86 PC processors William B. Vogt
What is statistics: Analyze • What we want to know about population • Does quantity differ in different groups • Average income of Northern vs. Southern households • Average income of Target vs. Walmart shoppers • Market share of Intel in desktop vs. mobile x86 PC processors William B. Vogt
What is statistics: Analyze • What we want to know about population • How are two/more variables related • As income rises, how much does consumption of Starbucks coffee rise? • As family size rises, how do sensitivities to price and advertising change? • As people age, how does their sensitivity to advertising change? William B. Vogt
What is statistics: Analyze • How can we know these things? • Collect a census • Accurate information on the whole population • Ask everyone in the US their income • Audit their answers carefully • This is always expensive and often impossible • Imaginary populations? • Parallel universes? William B. Vogt
What is statistics: Analyze • How can we know these things? • Collect a sample • Sample: A few members of a population. • Accurate information on the sample • Ask 100 people in the US their income • Audit their answers carefully? • But knowing all about the sample knowing all about the population! William B. Vogt
What is statistics: Analyze • Going from sample to population, we hope for: • A good description of the sample • An estimate (“informed guess”) of what we want to know about the population • A statement about how far off our estimate might be William B. Vogt
What is statistics: Analyze • Population and sample, and example • What is avg household income in US? • Phone survey of 100 completed households, asking their incomes • Sample = {$50K, $23K, … , 180K} • Average sampled income is, say, $53K • My estimate of US avg household income is $53K and I am 95% sure that it is in the range $53K $3K William B. Vogt
What is statistics? • Description of a sample • Analysis: estimation of quantities of interest for a population • Levels • Differences • Relationships • Analysis: statement of how far off the estimates might be William B. Vogt
Data description • Topic of chapter 2 is describing data • This is the part of the definition of statistics in which we describe our data • This is also the part of our goals in sampling in which we describe our sample accurately • Topic of the rest of the book/course will be analysis William B. Vogt
Data Description: Population and Sample • Population • All of the relevant people/units you are interested in • For example • Population of people in the US • Population of our customers • Population of light bulbs from our production facility William B. Vogt
Data Description: Population and Sample • Sample • A subset of a population. Only a few of the units we are interested in. • For example: • Survey of 100 people in the US • Survey of 5 of our customers • 1 in 100,000 of the light bulbs from our production facility William B. Vogt
Data Description: Dataset • A dataset is just a group of numbers measuring something • Could be a population • Could be a sample • Examples • A list of all the incomes of all the households in the US • A list of the incomes of 5 of our customers • A list of the time to failure of 500 light bulbs from our production facility William B. Vogt
Data Description: Dataset • Notation • Dataset = {4,5,12,3,…,0} • Dataset = {x1, x2, x3, … , xN} • xi = any arbitrary one of the elements in our dataset • N = the number of elements (observations) in our dataset William B. Vogt
Data Description: Dataset • Notation • Example: • Dataset = {4,5,12,3} • x1=4 • x3=12 • N=4 William B. Vogt
Data Description: Measures of central tendency • Measures of central tendency • Measure where the “middle” of the data are • Useful if you want to know what an average or typical member of your sample/population looks like William B. Vogt
Data Description: Measures of central tendency • Mean • Also known as average • Calculated by adding up all the observations and dividing by the number of observations William B. Vogt
Data Description: Measures of central tendency • Mean • Can also be written: William B. Vogt
Data Description: Measures of central tendency • Mean • Example: • Dataset = {53,45,23,19,87} • Mean William B. Vogt
Data Description: Measures of central tendency • Median • Also known as the 50th percentile • Is the point in the data where half of the observations are greater and half are lesser. • Calculated by sorting the data and choosing the middle value William B. Vogt
Data Description: Measures of central tendency • Median • Example: • Dataset = {53,45,23,19,87} • Dataset sorted = {19,23,45,53,87} • Median = 45 William B. Vogt
Data Description: Measures of central tendency • Median • Example: • Dataset = {53,45,23,19,87,100} • Dataset sorted = {19,23,45,53,87,100} • Median = (45+53)/2 = 49 William B. Vogt
Data Description: Measures of central tendency • Percentiles • The 25th percentile is the point at which 25% of the data are lesser and 75% of the data are greater • The 75th percentile is the point at which 75% of the data are lesser and 25% of the data are greater • The Yth percentile is the point at which Y% of the data are lesser and (100-Y)% of the data are greater • The median is the 50th percentile William B. Vogt
Data Description: Measures of central tendency • Percentiles • Calculation • Sort the dataset • The 25th percentile is observation (N+1)/4 or 0.25*(N+1) • The 75th percentile is observation 3*(N+1)/4 or 0.75*(N+1) • The Yth percentile is observation (Y/100)*(N+1) • Use interpolation if (Y/100)*(N+1) is not a whole number William B. Vogt
Data Description: Measures of central tendency • Percentiles • Calculation • Use interpolation if (Y/100)*(N+1) is not a whole number • If there are 10 observations, and you want the 44th percentile • 0.44*(10+1)=4.84 • So, the 44th percentile will be the number 84% of the way between observations 4 and 5 • 44th percentile = 0.16*x4 + 0.84* x5 William B. Vogt
Data Description: Measures of central tendency • Percentiles • Example • Dataset = {45,23,110,19,87,36,100} • Sorted dataset = {19,23,36,45,87,100,110} • 25th percentile=23 • 75th percentile=100 • 50th percentile (median)=45 William B. Vogt
Data Description: Measures of central tendency • Percentiles • Example • Dataset = {45,23,110,19,87,36,100} • Sorted dataset = {19,23,36,45,87,100,110} • 82nd percentile? • 0.82(8)=6.56 • 82nd percentile = 0.44*x6 + 0.56* x7 • 82nd percentile = 0.44*100+ 0.56*110 = 105.6 William B. Vogt
Data Description: Measures of central tendency • Mode • The most common value in the dataset • Might think of as the most typical value • Example • Dataset = {53,45,45,23,19,87,100} • Mode = 45 • Example • Dataset = {53,45,45,23,19,87,87} • Mode = 45 and 87 --- data are bimodal William B. Vogt
Data Description: Measures of dispersion • Measures of dispersion tell us how “spread out” our data are: • Compare: • Dataset 1 = {53,45,23,19,87} • Dataset 2 = {44,47,43,45,48} • Both have a mean of 45.4 • Dataset 1 is more spread out, however William B. Vogt
Data Description: Measures of dispersion • Let’s display the datasets graphically: DS1: 19 45 87 DS2: William B. Vogt
Data Description: Measures of dispersion • A good measure of dispersion will, for example, be bigger for DS1 than for DS2 DS1: 19 45 87 DS2: William B. Vogt
Data Description: Measures of dispersion • “Average deviation” • One way to think about dispersion is to ask how far, on average, points are from the mean • Call di the deviation from the mean: William B. Vogt
Data Description: Measures of dispersion • “Average deviation” • Dataset 1 = {53,45,23,19,87} • Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} • Maybe the average of the di will be a good measure of dispersion • (7.6-0.4-22.4-26.4+41.6)/5 = 0 • Hmmm. William B. Vogt
Data Description: Measures of dispersion • “Average deviation” • Does this always happen? William B. Vogt
Data Description: Measures of dispersion • “Average deviation” • What happened? • Average deviation offsets negative deviations against positive deviations. • So observations below the mean make this measure smaller, while obs above the mean make it bigger. • Both kinds should count as positive William B. Vogt
Data Description: Measures of dispersion • Mean absolute deviation William B. Vogt
Data Description: Measures of dispersion • Mean absolute deviation • Example • Dataset 1 = {53,45,23,19,87} • Deviations 1 = {7.6,-0.4,-22.4,-26.4,41.6} • Absolute Dev 1 = {7.6,0.4,22.4,26.4,41.6} • MAD=(7.6+0.4+22.4+26.4+41.6)/5=19.68 William B. Vogt
Data Description: Measures of dispersion • Mean absolute deviation • Example • Dataset 2 = {44,47,43,45,48} • Deviations 2 = {-1.4,1.6,-2.4,-0.4,2.6} • Absolute Dev 2 = {1.4,1.6,2.4,0.4,2.6} • MAD=(1.4+1.6+2.4+0.4+2.6)/5=1.68 William B. Vogt
Data Description: Measures of dispersion • Mean absolute deviation MAD=19.68 DS1: 19 45 87 MAD=1.68 DS2: William B. Vogt
Data Description: Measures of dispersion • Variance • Solves the problem of negative deviations in a different way • Variance calls for the deviations to be squared: William B. Vogt