360 likes | 376 Views
QM222 Class 2 Section D1 Describing Data. Choose your seat. This will be your permanent seat. Schedule. Reminder: I have office hours today 2:15-3:30 (after class) or W 11-12:15 (before class) You should have been working on your topic. If you don’t have a clue, come to office hours .
E N D
QM222 Class 2 Section D1Describing Data Choose your seat. This will be your permanent seat. QM222 Fall 2016 Section D1
Schedule • Reminder: I have office hours today 2:15-3:30 (after class) • or W 11-12:15 (before class) • You should have been working on your topic. If you don’t have a clue, come to office hours. • TA office hours: • Danika Guiley: dnkgly@bu.eduT 7:00-8:00 pm (205A) Sun 1:00-3:00 (undergrad lounge) • Justin Padilla: jpadilla@bu.edu • Office hours: W 5:00-6:00 205A Th5:00-6:00 205A • Assignment 1 is due next Monday Sept 19. • Sign up for an appointment (see signup, Sept 21 latest day) • You should have bought Stata, or be planning to go regularly to the third floor computer lab. • NEXT CLASS WILL MEET IN A DIFFERENT CLASSROOM – Room 314 (which has Stata on its computer) QM222 Fall 2016 Section D1
Assignment 1 • What specific question or questions will your project address? • What is the data set you plan to use? • What is main variable or variables in this data set that you plan to predict or explain? • What company, governmental body or other organization would be interested in knowing the answer to this question? QM222 Fall 2016 Section D1
Today’s Objectives • Review data characteristics • Describing a single variable: What statistics can be used • Mean and median • Standard deviation • Percentages • Distributions • How to calculate them in Excel • How to use them to answer questions QM222 Fall 2016 Section D1
What data sets look like (Movie data from IMDB, metacritic) QM222 Fall 2016 Section D1
How Data-sets are organized • Each row is an observation. • One occurrence of the thing you are examining. • In the data set on the next page – the observation is one movie. • n is the number of observations. • Each column is a variable. • Something you know about the observation. • Here – the name, year, metascore, budget etc. • Each column is a variable. • Notation: X as variable, Xi is the value of X for observation i QM222 Fall 2016 Section D1
Types of Data • Numerical (also known as Quantitative): • These variables take on a number, and represent some kind of measurement. • Ex: income, number of products sold, price, age, weight, height, etc. • Categorical: puts an observation into a category, but is not easily represented by a number. • Ex: gender (male, female), race (white, black, Asian, etc.), college major (business, literature, history, etc.), seasons • We will learn how apply statistical tools to categorical data • You must not use categorical data as if it were numerical. QM222 Fall 2016 Section D1
Cross sectional v. Time Series v. Panel Data Sets • Cross Section – at one point of time, each observation is a different person, company etc. • v. Time series – each observation is a different point of time • v. Cross Section –Time Series: each observation is a different company etc. at a specific point of time. Time here is a variable like all others. • v. Panel data, longitudinal data: the same people/ companies etc. are observed at different points of time; each observation is a specific company etc. at a specific point of time QM222 Fall 2016 Section D1
What kind of data? A. Cross-section B. Time-series C. Cross section-Time series D. Panel QM222 Fall 2016 Section D1
What kind of data? A. Cross-section B. Time-series C. Cross section-Time series D. Panel QM222 Fall 2016 Section D1
What kind of data? A. Cross-section B. Time-series C. Cross section-Time series D. Panel QM222 Fall 2016 Section D1
Descriptive statistics (SM221 review) QM222 Fall 2016 Section D1
Measuring the middle • Mean • Add up all the values and dividing by the number of observations • In Excel: =average(...) • Median • Half the observations are greater than the median, half the observations are smaller than the median. • In Excel: =median(...) • Are the Mean and the Median equal? • Which of the two is affected by outliers? QM222 Fall 2016 Section D1
How to estimate when people will arrive at a party: What do you think is larger—the median or the mean? QM222 Fall 2016 Section D1
How to estimate when people will arrive at a party: Mean arrival: 67 minutes late Median arrival: 45 minutes late The mean is sensitive to outliers (those 180-240 mins late) QM222 Fall 2016 Section D1
Percentiles • Let’s say you ordered all the n observations of a variable, from lowest to highest (e.g. in a column of data) • For instance, from the lowest #minutes late to the highest • The one in the middle (of the column) is the person at the 50th percentile. • The one in the middle has the median #minutes late (45) • However, you can also talk about other percentiles. QM222 Fall 2016 Section D1
You really DO NOT want to be in the worst 10%... So you do NOT want to be at or above the top 90th percentile of lateness. When should you come? QM222 Fall 2016 Section D1
You really DO NOT want to be in the worst 10%... So you do NOT want to be at or above the top 90th percentile of lateness. When should you come? 90th percentile is at 2½ hours late. Come before that. QM222 Fall 2016 Section D1
In-Class exercise Part 1 Suppose we want to compare how well off the average person is now versus in the past. We find the following statistic: • Average (mean) income per capita rose from $7,787 in 1980 to $30,176 in 2014. • What does this statistic imply about the change in per capita income between 1980 and 2014? • What are the problems (there are more than 1!) with using the statistic, and what would be a better statistic to report? QM222 Fall 2016 Section D1
In-Class exercise Part 1 answers Suppose we want to compare how well off the average person is now versus in the past. We find the following statistic: Average (mean) income per capita rose from $7,787 in 1980 to $30,176 in 2014. What does this statistic imply about the change in per capita income between 1980 and 2010? • The mean (nominal) income per capita has increased by almost 300% between 1980 and 2014: (30176-7787)/7787=2.875 or 30176/7787=3.875 What are the problems (there are more than 1!) with using the statistic, and what would be a better statistic to report? • Doesn’t adjust for inflation • Growing inequality has changed the mean but not the median wage • Giving more details about the distribution would be even better. QM222 Fall 2016 Section D1
Some real data The mean was $76,836. Recall: the mean is sensitive to outliers! To foreshadow what is coming later in lecture… it helps to look at the distribution of income QM222 Fall 2016 Section D1
Growing inequality has increased income at the higher percentiles, but not the median wage. What has happened to the MEAN? QM222 Fall 2016 Section D1
Measures of the Spread • Range: Max – Min • Simple, but it is highly affected by ONE unusual observations. Excel: =max(a2:a64 ) – min (a2:a64) • The 25th and 75th percentiles: Excel: =percentile(a2:a64,0.25) gives the value at the 25th percentile of the data set =percentile(a2:a64,0.75) gives the value at the 75th percentile of the data set. Q: What share of observations are between the 25th and 75th percentiles? QM222 Fall 2016 Section D1
Standard deviation How far is our data spread out around the mean? In Excel: =stdev(…) NOTE: The variance also measures how spread out our data is, and it equals the standard deviation squared! QM222 Fall 2016 Section D1
For Discussion If I give you the bad news that you got 65 in the exam and the class average was 78, in which situation would you rather be: (a) Std deviation of the class was 5 (b) Std deviation of the class was 13 (c) Or are you indifferent? QM222 Fall 2016 Section D1
Percentiles • If we sort our variable from smallest to largest, the Yth percentile corresponds to the value where Y% of our observations lie below that value • The 25th percentile and the 75thpercentilecan also provide a useful measure of dispersion (spread) in our data • Excel example: • =percentile(a2:a64,.25) gives the value at the 25th percentile of the data set • =percentile(a2:a64,.75) gives the value at the 75th percentile of the data set. QM222 Fall 2016 Section D1
How to estimate when people will arrive at a party: 25th percentile: 15 minutes late 75th percentile: 1:45 minutes late QM222 Fall 2016 Section D1
In-Class exercise Part 2What salary can you expect to get? Use the dataset “Class 2ACS Business-Major Earnings” (available on our site under Other Materials→Data and other materials used in class). Fill in all but the last row. QM222 Fall 2016 Section D1
Thinking about distributions Sometimes plotting a histogram showing the distribution of our data can be very helpful. Histograms have the values of the variable on the X-axis, and the # of cases on the Y-axis. Here is a histogram of starting salaries of 195 Questrom BS graduates 2016. What do we learn from it? QM222 Fall 2016 Section D1
Distributions • “Distributions” are similar to histograms, except that: • In distributions, the bins are tiny and are not separated (below I’ve drawn smaller but not tiny bins) • The Y-axis is the % of cases, not the # of cases • Therefore the area beneath a distribution adds to 1 (100%). QM222 Fall 2016 Section D1
The most used distribution? The normal distribution QM222 Fall 2016 Section D1
Normal Distribution • Approximately 68% (or around 2/3rds) of the observations are within one standard deviation of the mean. • Approximately 95% of the observations are within two standard deviations of the mean. A “Normal distribution” looks like a symmetric bell curve • Symmetric means that the right side of the mean is a mirror image of the left side • Bell curves look like a bell. • Notation here: μis the mean, and σ is the standard deviation QM222 Fall 2016 Section D1
The opposite of a symmetric distribution is a skewed distribution • We saw that income was skewed to the right. • Another well-known type of data that is skewed to the right is medical costs • Most patients are very cheap, but a few massively expensive patients drive a lot of our growing healthcare costs …policymakers and insurance companies worry a lot about how to reduce costs for these most sick & expensive patients! QM222 Fall 2016 Section D1
(Back to normal distributions) Weather : Climate change by graphs • This plot shows temperature patterns in the Northern Hemisphere for each decade from the 1950s through the 2000s. • What is happening to the mean of the distribution over time? • What is happening to the standard deviation of the distribution? • What does this suggest about climate change? QM222 Fall 2016 Section D1
Schedule Reminders • Reminder: I have office hours today 2:15-3:30 (after class) • or W 11-12:15 (before class) • You should have been working on your topic. If you don’t have a clue, come to office hours. • TA office hours: • Danika Guiley: dnkgly@bu.eduT 7:00-8:00 pm (205A) Sun 1:00-3:00 (undergrad lounge) • Justin Padilla: jpadilla@bu.edu • Office hours: W 5:00-6:00 205A Th5:00-6:00 205A • Assignment 1 is due next Monday Sept 19. • Sign up for an appointment (see signup, Sept 21 latest day) • You should have bought Stata, or be planning to go regularly to the third floor computer lab. • NEXT CLASS WILL MEET IN A DIFFERENT CLASSROOM – Room 314 (which has Stata on its computer) QM222 Fall 2016 Section D1
Misleading statistics George W. Bush pushed a tax reform that would give 92 million Americans an average reduction of over $1000. • Why might this be a “deceptive statistic”? • The median tax cut was less than $100! Imagine a new drug that increases median life expectancy by 2 weeks. • Why might this still be a useful drug? • Maybe a minority of patients— perhaps 30%— are cured entirely! • Here, average would be better, but we might want even more info. Bottom line: • Use the median when outliers distort the facts, or when you care about what is typical. • Use the mean when you should take account of outliers. • It’s a judgment call! QM222 Fall 2016 Section D1