300 likes | 309 Views
Learn about biases, anecdotes, random sampling, and statistical inference in data collection. Understand the dangers of anecdotal evidence and the importance of random sampling. By the end of this lecture, you will be able to identify sources of bias, differentiate between types of biases, and grasp key concepts in data collection and inference.
E N D
Learning Objectives By the end of this lecture, you should be able to: • List and describe two common methods by which we obtain data. • Define anecdote with example(s). • List two key goals in obtaining a reasonably good sample. • Recognize and identify sources of bias in sample. • Define ‘population’ in terms of its relationship to ‘sample’ • Define: selection bias, response bias, non-response bias, wording bias • Define statistical inference. • Identify and explain two key potential pitfalls in inference. Clearly this is not a “numbers” oriented lecture. In terms of quizzes and exams, I would suggest you go through the lecture a few times until you really can answer these objectives in your own words.
Where does data come from? Two main sources of data: • Available Data: As you would expect, this is data that is already available from some other source. For example, if you were trying to do an analysis of SAT scores, you could contact the testing service and ask for publicly available data. Similarly, the US government puts census data on the web. Other government agencies and many organizations make data available as well. In this day and age, there is more data available at the click of a mouse than one could possibly hope to make use of in a lifetime. • Samples: Sometimes we need to ask a question for which there is no available data. For example, suppose we wanted to try to find the average height of DePaul undergraduate women. Clearly it would be impossible (or at least time-consuming and costly) to measure every such student at DePaul. For this reason we take a random sample of DePaul undergraduate women, and hope that from there, we can infer information about the population of all undergraduate female DePaul students. • The terms ‘infer’ , ‘random sample’ and ‘population’ are important terms. We will discuss each of them in more detail.
What is one TERRIBLE way to obtain data? • Answer: “Anecdotes”. That is, things that “we have heard”, or “are widely known”, or “are common knowledge”, as opposed to data that comes from evidence.
A new 4-letter (o.k., 8-letter) Word:Anecdote Anecdotal evidencerefers to people accepting as evidence individual stories/incidents that they have heard about. That is, the kinds of things we hear from friends, case-studies in the press, “crazy coincidences”, etc. Anecdotes are based on selected, individual cases. Yet we tend to remember them, because they are often unusual in some way. This is because we don’t bother remembering the non-coincidences that happen tens of thousands of times every single day. Humans are really, really, REALLY good at “spotting” patterns when, in fact, no pattern actually exists. Key Point: Anecdotal “evidence”, is NOT evidence!
Anecdote: • Smoking doesn't cause lung cancer, "My grandmother lived to 95 and smoked like a chimney, and didn't die of lung cancer." • It is certainly true that not all smokers die of lung cancer. However, the vast majority of people who get lung cancer are smokers.
Anecdote: • Homeopathy works! “My aunt had arthritis for 15 years. She went to 8 different specialists, none of whom could cure her. But then she tried byronia 5ch and within 3 days it had improved.” • People’s medical conditions do change with time. For every person whose condition improves right around the time they start taking an alternative therapy, there are thousands who do not. • Like plane crashes, we as humans, naturally, only talk about the findings that are interesting to us. We aren’t lying, we’re just… human. • Remember that as humans, we are great at finding patterns – even when they either do not exist, or they are not causal.
Anecdote: • People wearing top hats live longer. Back in the day, this fact, was supported by a great deal of anecdotal evidence. “My grandpa wore a top hat and lived until 97 years old!” • People that wear top hats are usually richer, therefore can afford better food, shelter, sanity, and medical resources. A wider study that “controlled for” (important term!) people's income was easily able to show that the claim was false. • This is a great example of how people who do not possess the knowledge to recognize about lack of causation are easily fooled! In fact, it is one of the most common ways in which statistics are abused. • For more information on this story, be sure to join us for the 6:00 news…
The plural of anecdote is anecdotes, it is not evidence! Think about the internet “echo chamber”. People of similar interests are constantly reinforcing their beliefs by restating those beliefs to people with typically similar views. When is the last time any of us have spent real time on websites reading political journals/blogs/etc from people we don’t agree with?!
Where does data come from? Two main sources are: Available Data, and Samples. SAMPLES: • Frequently, we are interested analyzing a topic for which there is no available data. In this case, we take a “sample” of observations and hope that from that sample, we can infer information about the rest of the population. • Making sure you get a proper sample is a HUGELY important issue when it comes to setting up a study. There are many issues to think about. For now we will concern ourselves with two in particular: • Sample Size: More is typically better. However it is not always feasible, and can often be very costly. If a sample size is very small, however, it can severely limit our ability to draw any meaningful conclusions. • Randomization: This is one of the most important and widely abused aspects of study design and for this reason, we will discuss separately. For now, it is important to recognize that when choosing a sample, the people (or whichever observations) must be chosen at random and must be representative of the population you are interested in.
Sample Size • Researchers and statisticians love large sample sizes, as the larger the sample size (‘n’), the more confident we are in the results. • However, larger samples are not always practical or even possible. • Suppose you want to test a new cancer drug and you wanted to enroll 500 patients. Now suppose that the drug costs $200,000 a year (which could happen). This study would almost certainly not be feasible. • Suppose you wish to investigate a very rare form of cancer. It is so rare that there are only 173 cases in the entire country. And only 14 of them are even remotely in your geographic area. Unless you can somehow make the study work remotely (many can not), you are stuck with an n of 14. • Suppose you were okay with the previous study of 14 people – only to find out that 2 are unwilling to join your study because they can’t commit to the time requirements, and another 4 have other illnesses that prevent them from participating.
Making sure your sample is random • Imagine that we are doing a relatively simple study in order to determine the average height of DePaul undergraduate women. • How difficult would it be to get a large sample? • In this case, it would not be very difficult to obtain a large sample. • However, where would you obtain this sample? • At a basketball practice? • At a gymnastics meet? • Both of those places? Neither? • Answer: You would try to spread it out and avoid places where there is clearly a ‘bias’ in favor of a particular height pattern. So ideally, you would sample women from all campuses, at all times of day, and in all majors. At that point you might be reasonably confident that you have randomly chosen a random and representative sample of undergraduate women at DePaul.
Random sampling: • Key point: Individuals are selected at random and no one group is over-represented. Random sampling avoids several potential sources of bias.
A sample that is not random is essentially useless. ‘Nuff said.
Sample: The part of the population we actually examine. A statistic is a number describing a characteristic of a sample. Population versus Sample • Population: The entire group of individuals in which we are interested but can’t usually assess directly. • A parameter is a number describing a characteristic of the population. Population Sample
Sample: The part of the population we actually examine. Examples: We sample 200 working-age people in California We sample 30 DePaul undergraduate women We sample 150 male crickets Population versus Sample • Population: The entire group of individuals in which we are interested but can’t usually assess directly. Examples: • Income of all working-age people in California • Height of all DePaul undergraduate women • Length of all male crickets
Population vs Sample • A political scientist wants to know what percentage of college students consider themselves conservatives. • An automaker hires a market research firm to learn what % of adults 18-35 recall seeing TV ads for a new SUV. • Government economists want to know about average household income in Chicago. • It would be impossible to ask these questions of every single college student / adult / household. Instead, we ask a sample of college students / adults / households. • The population refers to the entire group that we want information about • The sample is the small section of the population that we actually examine • The GOAL of a study is to take the information we derive from the sample, and to generalize it, i.e. to “infer” information about the entire population. • Identify the population for the three examples mentioned above: • All college students. • All adults aged 18-35 years old • All households in Chicago. • However, this is sloppy. Do we mean greater Chicago? Are we including both inner-city Chicago and the Gold Coast? If we do, we are basically looking at two very different groups! (Recall from our discussion on categorical variables in scatterplots: When you have different groups which are likely to have their own unique dataset, you should plot them separately).
BIAS • It may not always be intentional, but it’s always there!
If you’re biased and you know it… • Biases are everywhere • It is very important to be aware of the different types of bias and where they tend to show up.
Examples of bias seen in sampling methods 1. Convenience sampling: Just ask whoever is around. • Example: “Man on the street” survey (cheap, convenient, often quite opinionated, or emotional => now very popular with TV “journalism”) • Which men, and on which street? • Ask about gun control or legalizing marijuana “on the street” in Berkeley v.s. rural Texas and you would get wildly different results. • Even within an area, answers would probably differ if you did the survey outside a high school or a country western bar. • Bias: Opinions limited to individuals who are present.
2. Voluntary Response Sampling: • Individuals choose to be involved. These samples are very susceptible to being biased because different people are motivated to respond or not. Often called “public opinion polls,” these are not considered valid or scientific. • Bias: Sample design systematically favors a particular outcome. Bias present? Ann Landers summarizing responses of her readers: “70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t. “ Bias: Most letters to newspapers are written by disgruntled people. A later sample found the exact opposite result! Incidentally, it turned out that this sample was also very flawed.
Online surveys – Is there a bias? Answer: Voluntary response bias. People have to care enough about an issue to bother replying. This sample is probably a combination of people who hate “wasting the taxpayers money” and “animal lovers.”
Common biases you should be able to identify: • Nonresponse Bias: People who feel they have something to hide or who don’t like their privacy being invaded probably won’t answer. Yet they are absolutely part of the population under study! • Remember that the most important objective of a good sample is for that sample to accurately represent the population. • Response Bias: Fancy term for lying. This is particularly important when the questions are very personal (e.g., “How much do you drink?”) • Wording effects Bias: Questions worded like “Do you agree that it is awful that…” are prompting you to give a particular response. • Selection Bias: an important one – upcoming slide… • Etc, Etc Bias can show up in all kinds of unexpected ways (and not all of them have names).
Selection Bias This is a VERY common form of bias. It occurs when the group that is sampled has something in common that relates to the issue under consideration. • Example: You are conducting a poll to determine whether taxpayer dollars should be used to improve Wrigley Field. The pollsters randomly sample people from ‘The Cubby Bear’ (a popular Cubs bar), and outside the White Sox convention at the Palmer House. In both cases, there will be a selection bias – albeit likely with different results. • Example: The majority of people who are asked about their experiences with psychics report positive results. This is a selection bias since the people asked are motivated to have their beliefs validated. • Example: There is a tendency of people who review products they have purchased online to give positive reviews. The reason for this is the same as the example above. • Example:
Another sampling biggie: Undercoverage Occurs when parts of the population are left out in the process of choosing the sample. Because the U.S. Census goes “house to house,” homeless people are not represented. Illegal immigrants also avoid being counted. Geographical districts with a lack of coverage tend to be poor. Representatives from wealthy areas typically oppose statistical adjustment of the census. Historically, many clinical trials had avoided including women in their studies because of their periods and the chance of pregnancy. As a result, many medical treatments were not appropriately tested for women. This problem is slowly being recognized and addressed.
To assess the opinion of students at the Ohio State University about campus safety, a reporter interviews 15 students he meets walking on the campus late at night who are willing to give their opinion. What is the sample here? What is the population? Is there significant bias present? • All those students walking on campus late at night • All students at universities with safety issues • The 15 students interviewed • All students approached by the reporter • Sample: • The 15 students. Target population: All Ohio State Students. • Selection Bias: • People who feel safe are more likely to walk out at night. People who don’t feel safe probably won’t do so as often. They would be under-represented in the sample. • Possible Non-Response Bias: • Entirely possible that some people would hurry away or refuse to answer if someone approaches them with a question at night. • Others?
Example: An SRS (simple random sample) of 1200 adult Americans is selected and asked: “In light of the huge national deficit, should the government at this time spend additional money to establish a national system of health insurance?“ Thirty-nine percent of those responding answered yes. • What can you say about this survey? • If it is truly a random sample, then we are being told that the sampling process is relatively free from bias. However, in this case, the wording is biased. The results probably understate the percentage of people who do favor a system of national health insurance.
Assuming that this program truly did randomly sample ‘likely voters’ (as opposed to ‘likely voters who watch this particular program’), then this is a very reasonable poll. If you don’t like the results you find however…. Selection Bias. This is an egregious example, in that the selection bias was intentionally created by the pollsters.
Let’s play: Find the BiasWhat toothpaste do people prefer? • Experiment: In order to determine which brand of toothpaste Americans prefer, researchers wait outside of Whole Foods Market and ask everyone who bought toothpaste, which brand they preferred. What are some biases present (or potentially present) in this experiment? • Potential Biases? • Whole Foods is an upscale market. Many shoppers are from a higher income market and some will buy ‘boutique’ products (including toothpaste!) • Colgate is on sale • Crest just had an advertising blitz during the Superbowl • Oprah mentioned in an interview that she likes Aquafresh