360 likes | 659 Views
Statistics 111 - Lecture 2. Collecting Data. Surveys and Sampling/ Graphs of a Single Variable. Administrative Notes. Lecture notes on website Office hours today from 3-4:30pm Homework 1 available on website Due at beginning of class on Monday, June 1
E N D
Statistics 111 - Lecture 2 Collecting Data Surveys and Sampling/ Graphs of a Single Variable Stat 111 – Lecture 2 Sampling and Graphing
Administrative Notes • Lecture notes on website • Office hours today from 3-4:30pm • Homework 1 available on website • Due at beginning of class on Monday, June 1 • JMP “how to” guide for the homework on website Stat 111 – Lecture 2 Sampling and Graphing
Outline for First Half of Lecture • Introduction to Sampling • Voluntary Response Samples • Simple Random Samples • Sources of Sampling Bias • More complicated sampling schemes • Preview of Inference • Bias versus Variability • Read: Section 3.3 Stat 111 – Lecture 2 Sampling and Graphing
Survey Definitions • Population: entire group of objects or people about which information is sought • Census: survey of an entire population • Sample: survey that examines only a portion of the population • Parameter: a numerical characteristic of the population • Statistic: a numerical characteristic of the sample Stat 111 – Lecture 2 Sampling and Graphing
Why Sample? • Expense: cheaper than a census • Nielson ratings: based on 5000 out of an estimated 105.5 million US households with TVs • Time: quicker than a census • Exit polls: gives news agencies valuable (?) information on election day in order to project election before all votes (census) are counted • Sampled units must sometimes be destroyed (or changed) to measure characteristics • Reliability studies: testing lifetime of light bulbs, strength of windshields, etc. Stat 111 – Lecture 2 Sampling and Graphing
Sampling Bias • Systematic errors that result in a sample that is not representative of the overall population of interest • Just like in experiments, we must be cautious of potential sources of bias in our sampling results Stat 111 – Lecture 2 Sampling and Graphing
Voluntary Response Samples • People choose to be included in sample themselves by responding to a general appeal • Eg. Amazon consumer ratings • Results are often biased because people with strong opinions (usually negative) are more likely to respond and be included in the sample Stat 111 – Lecture 2 Sampling and Graphing
Hite Report: Women and Love (1987) • Hite mailed 100,000 questionnaires to groups of women professionals, counseling centers, church societies, senior citizens centers. Only 4.5% were returned • 84% of women are “not satisfied emotionally with their relationships” (p. 804) • 70% of all women “married five or more years are having sex outside of their marriages (p. 856) • 95% of women “report forms of emotional and psychological harassment from men with whom they are in love relationships” (p. 810) • 84% of women report forms of condescension from the men in their love relationships (p. 809) Stat 111 – Lecture 2 Sampling and Graphing
Simple Random Sampling (SRS) • Just as an experiment can be improved by randomization, so can sampling • Each individual in the population has an equal chance of being included in the sample • Does not allow self-response or evaluators to influence makeup of the survey (kinda like double-blinding in experiments) Stat 111 – Lecture 2 Sampling and Graphing
Example: Presidential Elections • In 1912 Literary Digest began using surveys to predict US presidential elections • “The poll represents 30 years constant evolution and perfection...” • In the 1936 Roosevelt vs Landon election, they polled 10 million voters: • 1,293,669 said they would vote for Landon • 972,897 said they would vote for Roosevelt • Reality: Landslide victory (61% to 37%) for FDR • What went wrong? Stat 111 – Lecture 2 Sampling and Graphing
Biases in Random Samples • Randomization doesn’t correct for certain problems with sampling • Bias 1: Undercoverage: some groups in the population are left out of the process of choosing the sample • Bias 2: Nonresponse: sampled individuals can not be contacted or do not cooperate • Eg. 1936 presidential polls • Low response rate: less than 25% of responded • Undercoverage of poorer demographics: sample of voters relied heavily on lists of automobile and telephone owners, which were generally more affluent voters • Well, at least we learned from those mistakes, right? Stat 111 – Lecture 2 Sampling and Graphing
Recent Presidential Elections • Using exit polls, several networks reported early that Gore would win Florida on 2000 election • Using exit polls, several pundits predicted Kerry would win Ohio in 2004 election • In general, we have gotten better, but still can make mistakes (especially when difference itself is so small) Stat 111 – Lecture 2 Sampling and Graphing
More Potential Problems with Surveys • Response Bias: respondents may not answer truthfully to survey questions • Illegal or unpopular behavior such as drug usage • Controversial topics such as teen sexual activity • Race or gender of interviewer can influence answers about race or gender-related questions • Respondents often have trouble remembering past events eg. yearly nutrition and health surveys Stat 111 – Lecture 2 Sampling and Graphing
More Potential Problems with Surveys • Wording of questions can be confusing or intentionally lead the respondent • Do you favor a ban on disposable diapers? • It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contrast, beverage containers, third-class mail and yard wastes account for 21% of the trash in landfills. Given this, would it be fair to ban disposable diapers? • Complicated multi-part forms that require lots of skipped questions lead to a drop off in response Stat 111 – Lecture 2 Sampling and Graphing
More Complicated Random Surveys • Weakness of simple random sampling is that you cannot use extra information about population (similar to blocking in experiments) • What if you know a particular group is missing from your sample? • Stratified random sampling: individuals are divided into groups called strata • Simple random sampling done within each stratum • National surveys can be even more complicated by using multistage sampling (cheaper) Stat 111 – Lecture 2 Sampling and Graphing
Dinner and Drugs Study • Study by CASA that linked frequent family dining to reduced risk of substance abuse “There is no more important thing that a parent can do” • Some problems with study that relate to what we know about surveysand observational studies • Problem 1: undercoverage of minority groups • Survey not representative of teen population Stat 111 – Lecture 2 Sampling and Graphing
Dinner and Drugs Study II • Problem 2: high level of non-response in survey • Many households declined to answer, didn’t complete survey or denied permission to use • Problem 3: observational study with lots of potential confounding variables • Drug use itself wasn’t measured, but rather a risk score for drug use • Study isn’t adjusted for age, which is also associated with drug use • No proof of causation! Stat 111 – Lecture 2 Sampling and Graphing
After Break • Exploring Data: Graphical summaries of a single variable • Moore, McCabe and Craig: Section 1.1 Stat 111 – Lecture 2 Sampling and Graphing
Break! • 5 minutes • More awesome statistics to come Stat 111 – Lecture 2 Sampling and Graphing
Outline for Second Half of Lecture • Characteristics of Distributions • Center, spread, shape, outliers • Plotting Distributions of Data • Boxplots • Histograms (no stem and leaf plots) • Density Curves • Read: Section 1.1 Stat 111 - Lecture 4 - Graphing
Definitions • Variable: any characteristic that takes different values for different individuals • Categorical variables place an individual into one of several groups • Examples: gender, race • Quantitative variables take on numerical values that are usually considered as continuous • Examples: height, age, wages Stat 111 - Lecture 4 - Graphing
Distributions • A distribution describes what values a variable takes and how frequently these values occur. • The distribution of a variable can be described graphically and numerically in terms of: • Center: where are most of the values located? • Spread: how variable are the values? • Shape: is the distribution symmetric or skewed? Are there multiple peaks or just one? • Outliers: are there certain values that seem surprisingly large or small? Stat 111 - Lecture 4 - Graphing
Barplots and Pie Charts • For categorical variables, we can graph the distribution using bar plots and pie charts Stat 111 - Lecture 4 - Graphing
Barplots and Pie Charts • Pie charts are generally not as useful as bar plots • Need to have all categories to make a pie chart • harder to compare subsets of categories • Scale of pie charts can sometimes be misleading • harder to see small differences Stat 111 - Lecture 4 - Graphing
Boxplots • Box plots are an effective tool for conveying information of continuous variables • Box contains the central 50% of the data, with a line indicating the median • Median is the value with 50% of data on either side • Whiskers contain most of the rest of the data, except for suspected outliers • Outliers are suspiciously large or small values Stat 111 - Lecture 4 - Graphing
Boxplot: Shoe Size of Stat 111 Class • Almost all values are between 5 and 13 • 50% of values are between 7.5 and 10 • Center (Median) is around 8.5 • Couple of suspected outliers: 14 and 14.5 Stat 111 - Lecture 4 - Graphing
Summary of Boxplots • Useful for displaying center and spread of a distribution, as well as potential outliers • However, boxplot doesn’t really give us much of an idea of the shape of the distribution • Histograms are much better graphical summaries of shape • We’ll see boxplots again in Chapter 2, for comparing distributions across groups Stat 111 - Lecture 4 - Graphing
Histograms • Histograms emphasize frequency of different values in the distribution • X-axis: Values are divided into bins • Y-axis: Height of each bin is the frequency that values from that bin appear in dataset Stat 111 - Lecture 4 - Graphing
Another Example: Height in Stat 111 • Vertical axis is sometimes the density (or relative frequency) : equal to the frequency of the bin divided by the total number of obs Stat 111 - Lecture 4 - Graphing
Histograms versus Boxplots • Both graphs give a good idea of the spread • Boxplots may be a little clearer in terms of the center and outliers in a distribution center outliers center spread of likely values Stat 111 - Lecture 4 - Graphing
Histograms versus Boxplots • Histograms much more effective at displaying the shape of a distribution • Skewness: departure from left-right symmetry • Multi-modality: presence of multiple high frequency values clearly not symmetric not symmetric? possible second peak? Stat 111 - Lecture 4 - Graphing
Symmetry - Histograms vs. Boxplots Stat 111 - Lecture 4 - Graphing
Density Curves • Often easier to examine a distribution with a smooth curve instead of a histogram • Example: vocabulary scores from 947 seventh graders in Gary, Indiana Stat 111 - Lecture 4 - Graphing
Example with Test Score Data • Number of scores less than 6 in population is 287 out of 947, so relative frequency is 0.303 • Using a density curve (normal distribution), the approximate frequency is 0.293 Stat 111 - Lecture 4 - Graphing
Approximations • Real data will never exactly fit a density curve ie. be exactly symmetric or normally-distributed • We will talk later in course about how to fit these density curves and we will use them to make probability calculations Stat 111 - Lecture 4 - Graphing
Next Class - Lecture 3 • Using JMP • Exploring Data: Numerical summaries of a single variable • Moore and McCabe: Section 1.2 Stat 111 - Lecture 4 - Graphing