1 / 36

Collecting Data

Statistics 111 - Lecture 2. Collecting Data. Surveys and Sampling/ Graphs of a Single Variable. Administrative Notes. Lecture notes on website Office hours today from 3-4:30pm Homework 1 available on website Due at beginning of class on Monday, June 1

billy
Download Presentation

Collecting Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics 111 - Lecture 2 Collecting Data Surveys and Sampling/ Graphs of a Single Variable Stat 111 – Lecture 2 Sampling and Graphing

  2. Administrative Notes • Lecture notes on website • Office hours today from 3-4:30pm • Homework 1 available on website • Due at beginning of class on Monday, June 1 • JMP “how to” guide for the homework on website Stat 111 – Lecture 2 Sampling and Graphing

  3. Outline for First Half of Lecture • Introduction to Sampling • Voluntary Response Samples • Simple Random Samples • Sources of Sampling Bias • More complicated sampling schemes • Preview of Inference • Bias versus Variability • Read: Section 3.3 Stat 111 – Lecture 2 Sampling and Graphing

  4. Survey Definitions • Population: entire group of objects or people about which information is sought • Census: survey of an entire population • Sample: survey that examines only a portion of the population • Parameter: a numerical characteristic of the population • Statistic: a numerical characteristic of the sample Stat 111 – Lecture 2 Sampling and Graphing

  5. Why Sample? • Expense: cheaper than a census • Nielson ratings: based on 5000 out of an estimated 105.5 million US households with TVs • Time: quicker than a census • Exit polls: gives news agencies valuable (?) information on election day in order to project election before all votes (census) are counted • Sampled units must sometimes be destroyed (or changed) to measure characteristics • Reliability studies: testing lifetime of light bulbs, strength of windshields, etc. Stat 111 – Lecture 2 Sampling and Graphing

  6. Sampling Bias • Systematic errors that result in a sample that is not representative of the overall population of interest • Just like in experiments, we must be cautious of potential sources of bias in our sampling results Stat 111 – Lecture 2 Sampling and Graphing

  7. Voluntary Response Samples • People choose to be included in sample themselves by responding to a general appeal • Eg. Amazon consumer ratings • Results are often biased because people with strong opinions (usually negative) are more likely to respond and be included in the sample Stat 111 – Lecture 2 Sampling and Graphing

  8. Hite Report: Women and Love (1987) • Hite mailed 100,000 questionnaires to groups of women professionals, counseling centers, church societies, senior citizens centers. Only 4.5% were returned • 84% of women are “not satisfied emotionally with their relationships” (p. 804) • 70% of all women “married five or more years are having sex outside of their marriages (p. 856) • 95% of women “report forms of emotional and psychological harassment from men with whom they are in love relationships” (p. 810) • 84% of women report forms of condescension from the men in their love relationships (p. 809) Stat 111 – Lecture 2 Sampling and Graphing

  9. Simple Random Sampling (SRS) • Just as an experiment can be improved by randomization, so can sampling • Each individual in the population has an equal chance of being included in the sample • Does not allow self-response or evaluators to influence makeup of the survey (kinda like double-blinding in experiments) Stat 111 – Lecture 2 Sampling and Graphing

  10. Example: Presidential Elections • In 1912 Literary Digest began using surveys to predict US presidential elections • “The poll represents 30 years constant evolution and perfection...” • In the 1936 Roosevelt vs Landon election, they polled 10 million voters: • 1,293,669 said they would vote for Landon • 972,897 said they would vote for Roosevelt • Reality: Landslide victory (61% to 37%) for FDR • What went wrong? Stat 111 – Lecture 2 Sampling and Graphing

  11. Biases in Random Samples • Randomization doesn’t correct for certain problems with sampling • Bias 1: Undercoverage: some groups in the population are left out of the process of choosing the sample • Bias 2: Nonresponse: sampled individuals can not be contacted or do not cooperate • Eg. 1936 presidential polls • Low response rate: less than 25% of responded • Undercoverage of poorer demographics: sample of voters relied heavily on lists of automobile and telephone owners, which were generally more affluent voters • Well, at least we learned from those mistakes, right? Stat 111 – Lecture 2 Sampling and Graphing

  12. Recent Presidential Elections • Using exit polls, several networks reported early that Gore would win Florida on 2000 election • Using exit polls, several pundits predicted Kerry would win Ohio in 2004 election • In general, we have gotten better, but still can make mistakes (especially when difference itself is so small) Stat 111 – Lecture 2 Sampling and Graphing

  13. More Potential Problems with Surveys • Response Bias: respondents may not answer truthfully to survey questions • Illegal or unpopular behavior such as drug usage • Controversial topics such as teen sexual activity • Race or gender of interviewer can influence answers about race or gender-related questions • Respondents often have trouble remembering past events eg. yearly nutrition and health surveys Stat 111 – Lecture 2 Sampling and Graphing

  14. More Potential Problems with Surveys • Wording of questions can be confusing or intentionally lead the respondent • Do you favor a ban on disposable diapers? • It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contrast, beverage containers, third-class mail and yard wastes account for 21% of the trash in landfills. Given this, would it be fair to ban disposable diapers? • Complicated multi-part forms that require lots of skipped questions lead to a drop off in response Stat 111 – Lecture 2 Sampling and Graphing

  15. More Complicated Random Surveys • Weakness of simple random sampling is that you cannot use extra information about population (similar to blocking in experiments) • What if you know a particular group is missing from your sample? • Stratified random sampling: individuals are divided into groups called strata • Simple random sampling done within each stratum • National surveys can be even more complicated by using multistage sampling (cheaper) Stat 111 – Lecture 2 Sampling and Graphing

  16. Dinner and Drugs Study • Study by CASA that linked frequent family dining to reduced risk of substance abuse “There is no more important thing that a parent can do” • Some problems with study that relate to what we know about surveysand observational studies • Problem 1: undercoverage of minority groups • Survey not representative of teen population Stat 111 – Lecture 2 Sampling and Graphing

  17. Dinner and Drugs Study II • Problem 2: high level of non-response in survey • Many households declined to answer, didn’t complete survey or denied permission to use • Problem 3: observational study with lots of potential confounding variables • Drug use itself wasn’t measured, but rather a risk score for drug use • Study isn’t adjusted for age, which is also associated with drug use • No proof of causation! Stat 111 – Lecture 2 Sampling and Graphing

  18. After Break • Exploring Data: Graphical summaries of a single variable • Moore, McCabe and Craig: Section 1.1 Stat 111 – Lecture 2 Sampling and Graphing

  19. Break! • 5 minutes • More awesome statistics to come Stat 111 – Lecture 2 Sampling and Graphing

  20. Outline for Second Half of Lecture • Characteristics of Distributions • Center, spread, shape, outliers • Plotting Distributions of Data • Boxplots • Histograms (no stem and leaf plots) • Density Curves • Read: Section 1.1 Stat 111 - Lecture 4 - Graphing

  21. Definitions • Variable: any characteristic that takes different values for different individuals • Categorical variables place an individual into one of several groups • Examples: gender, race • Quantitative variables take on numerical values that are usually considered as continuous • Examples: height, age, wages Stat 111 - Lecture 4 - Graphing

  22. Distributions • A distribution describes what values a variable takes and how frequently these values occur. • The distribution of a variable can be described graphically and numerically in terms of: • Center: where are most of the values located? • Spread: how variable are the values? • Shape: is the distribution symmetric or skewed? Are there multiple peaks or just one? • Outliers: are there certain values that seem surprisingly large or small? Stat 111 - Lecture 4 - Graphing

  23. Barplots and Pie Charts • For categorical variables, we can graph the distribution using bar plots and pie charts Stat 111 - Lecture 4 - Graphing

  24. Barplots and Pie Charts • Pie charts are generally not as useful as bar plots • Need to have all categories to make a pie chart • harder to compare subsets of categories • Scale of pie charts can sometimes be misleading • harder to see small differences Stat 111 - Lecture 4 - Graphing

  25. Boxplots • Box plots are an effective tool for conveying information of continuous variables • Box contains the central 50% of the data, with a line indicating the median • Median is the value with 50% of data on either side • Whiskers contain most of the rest of the data, except for suspected outliers • Outliers are suspiciously large or small values Stat 111 - Lecture 4 - Graphing

  26. Boxplot: Shoe Size of Stat 111 Class • Almost all values are between 5 and 13 • 50% of values are between 7.5 and 10 • Center (Median) is around 8.5 • Couple of suspected outliers: 14 and 14.5 Stat 111 - Lecture 4 - Graphing

  27. Summary of Boxplots • Useful for displaying center and spread of a distribution, as well as potential outliers • However, boxplot doesn’t really give us much of an idea of the shape of the distribution • Histograms are much better graphical summaries of shape • We’ll see boxplots again in Chapter 2, for comparing distributions across groups Stat 111 - Lecture 4 - Graphing

  28. Histograms • Histograms emphasize frequency of different values in the distribution • X-axis: Values are divided into bins • Y-axis: Height of each bin is the frequency that values from that bin appear in dataset Stat 111 - Lecture 4 - Graphing

  29. Another Example: Height in Stat 111 • Vertical axis is sometimes the density (or relative frequency) : equal to the frequency of the bin divided by the total number of obs Stat 111 - Lecture 4 - Graphing

  30. Histograms versus Boxplots • Both graphs give a good idea of the spread • Boxplots may be a little clearer in terms of the center and outliers in a distribution center outliers center spread of likely values Stat 111 - Lecture 4 - Graphing

  31. Histograms versus Boxplots • Histograms much more effective at displaying the shape of a distribution • Skewness: departure from left-right symmetry • Multi-modality: presence of multiple high frequency values clearly not symmetric not symmetric? possible second peak? Stat 111 - Lecture 4 - Graphing

  32. Symmetry - Histograms vs. Boxplots Stat 111 - Lecture 4 - Graphing

  33. Density Curves • Often easier to examine a distribution with a smooth curve instead of a histogram • Example: vocabulary scores from 947 seventh graders in Gary, Indiana Stat 111 - Lecture 4 - Graphing

  34. Example with Test Score Data • Number of scores less than 6 in population is 287 out of 947, so relative frequency is 0.303 • Using a density curve (normal distribution), the approximate frequency is 0.293 Stat 111 - Lecture 4 - Graphing

  35. Approximations • Real data will never exactly fit a density curve ie. be exactly symmetric or normally-distributed • We will talk later in course about how to fit these density curves and we will use them to make probability calculations Stat 111 - Lecture 4 - Graphing

  36. Next Class - Lecture 3 • Using JMP • Exploring Data: Numerical summaries of a single variable • Moore and McCabe: Section 1.2 Stat 111 - Lecture 4 - Graphing

More Related