GATHERING AND PRODUCING DATA

GATHERING AND PRODUCING DATA

How Data are Obtained • Census • Everyone is included • Observational Study • Observes individuals and measures variables but does not attempt to influence responses • Includes surveys and polls • Experiment • Deliberately imposes some treatment on individuals in order to observe their responses • In medicine, this is called a clinical trial

3 BIG ideas • Examine a partof the whole: take a sample from a population • Randomizationinsures the sample is representative • The size of the sampleis what’s important, not the size of the population

Big Idea #1: Examine Part of the Whole • We are studying an entire population of individuals (or subjects), but looking at everyone is practically impossible. • How many support the U.S. role in Iraq? • What percent of the tomato shipment is bad? • How many children are obese? • What’s the price of gas at the pump across Minnesota? • Settle for looking at a smaller group—a sample—selected from the population. • Sampling is natural! Think about cooking. You taste (sample) a small part to get an idea about the dish as a whole.

Populations and parameters, samples and statistics (This stuff is important!) • A parameter is a numerical quantity that describes a population. • A statistic is a numerical quantity that describes the sample. • We study a population by looking at a sample. We infer about a parameter by using statistics from the sample. • Notation: use Greek letters for parameters and Latin letters for statistics

Example: Polling Minneapolis Star Tribune: “A Gallup Poll, conducted Aug. 16-18, 1999, asked, ‘Do you consider pro-wrestling to be a sport, or not?’ Of the people polled, 19% said, “Yes.” (Results were based on telephone interviews with a randomly selected national sample of 1,028 adults, 18 years and older.)” • What’s the population, parameter, sample, statistic? • Population: Americans, 18 years and older • Sample: The 1,028 people who were polled • Parameter: The proportion of American adults who believe pro-wrestling is a sport. (Called the population proportion.) p = ? • Statistic: The proportion of people in the sample who said they believe pro-wrestling is a sport. (Called the sample proportion.) = 0.19

Example: Surveying a lot shipment A carload of ball bearings has an average diameter of 2.502 centimeters. This is within the specifications for acceptance of the lot by the purchaser. An inspector happens to inspect 100 bearings from the lot and finds the average diameter of these to be 2.499 cm. This is within the specified limits, so the entire lot is accepted. • What’s the population, parameter, sample, statistic? • Population: The carload of ball bearings • Sample: The 100 ball bearings that were inspected • Parameter: The average diameter of the ball bearings in the carload. µ = 2.502 cm (The population mean.) • Statistic: The average diameter of the 100 ball bearings in the sample. = 2.499 cm (The sample mean.)

Big Idea #2: Randomization • Randomization makes sure that on average the sample looks like the rest of the population. • Randomizationmakes it possible to use quantitative tools (probability) to draw inferences about the population when we see only a sample. • Randomization protects against bias.

“Who will you vote for in 2008?” Some examples of biased samples • 100 people at the Mall of America • 100 people in front of the Metrodome after a Twins game • 100 friends, family and relatives • 100 people who volunteered to answer a survey question on your web site • 100 people who answered their phone during supper time • The first 100 people you see after you wake up in the morning

Bias – the bane of sampling • Samples that systematically misrepresent individuals in the population are said to be biased. • Bias is the systematic failure of a sample to represent its population • There is usually no way to fix a biased sample and no way to salvage useful information from it. • The best way to avoid bias is to select individuals for the sample at random. The value of deliberately introducing randomness is one of the great insights of Statistics.

Simple Random Sample (SRS) • Suppose we want to draw a sample of size n from some population • For a simple random sample, every possible subset of size n has an equal chance to be selected and to become the sample. • Such samples guarantee that each individual has an equal chance of being selected. • Each combination of people also has an equal chance of being selected. • The sampling frame is a list of the population from which the sample is drawn. From the sampling frame, we can choose a SRS using random numbers.

SRS and Sampling Variability • Samples drawn at random generally differ from one another. • These differences lead to different values for the variables we measure. • Sample-to-sample differences are called sampling variability • This is different from bias! • Example: Everyone pick 10 Skittles at random from “The Bowl” and count how many reds. • The variability of the different sample counts is sampling variability. • If half the class peeked and tried to get more reds the differences would reflect bias.

Sources of sampling error • In the context of using a sample to estimate a population parameter, sampling variability is sometimes called “sampling error.” • Taking a SRS of 3 students to estimate the average height of all students will have a large sampling error, but it is not biased. • Taking a sample of 300 basketball players to estimate the average height of all students will produce less variability but the sample is biased.

More complex sampling designs • Simple random sampling is not the only way to sample. • More complicated designs may save time or money or help avoid sampling problems. • Stratified sampling • Cluster sampling • Systematic sampling • Multi-stage sampling • All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample.

Stratified sampling • Suppose we want a sample of 240 Carleton students • We also want to insure discipline representation • The student body divides as • Arts and Literature 20% • Humanities 15% • Social Sciences 30% • Mathematics and Natural Sciences 35% • For the sample, select 240 x .20 = 48 Arts and Lit students 240 x .15 = 36 Humanities students 240 x .30 = 72 Social science students 240 x .35 = 84 Natural science students • Within each discipline, choose a SRS

Stratified Sampling • The population is divided into homogeneous groups, called strata, before the sample is selected. • Then simple random sampling is used within each stratum before the results are combined. • Advantages • Sample will be representative for the strata • Reduces sampling variability • Disadvantages • May be logistically difficult if even possible to implement • Must have information about the population • Note: a stratified sample is not a SRS

Cluster sampling • Sometimes stratifying isn’t practical and simple random sampling is difficult. Splitting the population into clusters can make sampling more practical. • Suppose you want to do a face-to-face survey of attitudes in Minnesota based on a sample of size 600. • Choosing 600 people at random, finding their addresses, and meeting them in person is costly and time-consuming. • Another idea: Choose some cities at random. Then some streets at random, and then some blocks at random. Interview everyone on the selected blocks. • The blocks are the clusters. • If you know there are about 20 people per block. Then choose a random sample of 30 blocks.

Cluster sampling in the news:The Lancet study on Iraq casualties • In October 2006, The Lancet published “Iraq mortality after the 2003 invasion: a cross-sectional cluster sample survey” • The study was controversial because of its findings that hundreds of thousands of Iraqis (most likely about 650,000) had been killed since the U.S. invasion. • Earlier reports, including the U.S. and British government had put the number at about 30,000. • The study was based on cluster sampling, a common methodology in public health and human rights work • The clusters were groups of 40 houses in close proximity whose locations were chosen based on population demographics.

Cluster Sampling • If each cluster fairly represents the population, cluster sampling will give an unbiased sample. • Advantage • Easier to implement depending on context • Disadvantage • Greater sampling variability, so less statistical accuracy

Multistage Sampling • Most surveys conducted by the government or professional polling organizations use some combination of stratified and cluster sampling as well as simple random sampling. • Current Population Survey is how the government estimates the unemployment rate • Counties are divided into 2,007 Primary Sampling Units • PSUs are divided into smaller census blocks. And the blocks are grouped into strata. Households in each block are grouped into clusters of about 4 households each • The final sample consists of these clusters and interviewers go to all households in the chosen clusters.

Systematic Samples • Sometimes we draw a sample by selecting individuals systematically. • For example, you might survey every 10th person on an alphabetical list of students. • To make it random, you must still start the systematic selection from a randomly selected individual. • When there is no reason to believe that the order of the list could be associated in any way with the responses sought, systematic sampling can give a representative sample. • Systematic sampling can be much less expensive than true random sampling.

Sampling Example Hospital administrators are concerned about the possibility of drug abuse among employees. They plan to pick a sample of 40 from 800 employees, and administer a drug test. What’s the sampling strategy? • Randomly select 10 doctors, 10 nurses, 10 office staff, and 10 support staff for the test. • Each employee has a 4-digit ID number. Randomly choose 40 numbers. • At the start of each shift, choose every 20th person who arrives for work. • There are 40 departments of 20 employees each. Randomly choose two departments (say radiology and ER) and test all the people who work in that department.

Big Idea #3: Sample size is key, not population size • How large a sample size do we need for the sample to be reasonably representative of the population? • In general, it’s the size of the sample, not the size of the population, that makes the difference in sampling. • The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important • Back to cooking: If the soup is mixed enough a tablespoon will suffice, whether you’re “sampling” from a saucepan or from a barrel.

How big a sample? • Most professional polls choose a sample size of about 1,000 people. • These polls report a “margin of error” of about 3%. That means that with “high confidence” their estimates are within 3% of the true population parameter value. • The margin of error for a sample of 1,000 people is the same for Minneapolis (pop. 400,000), Minnesota (pop. 5 million), and the U.S. (pop. 290 million) • But the bad news is that if you want similar accuracy at Carleton, you need to poll over half the student body. • Coming Attractions: Margin of Error = and . But you’ll have to wait until we get to Statistical Inference to learn why.

How to Sample Badly • Advice columnist Ann Landers once asked parents “If you had it to do over again, would you have children?” • Do you think responses were representative of public opinion? • Over 100,000 people responded, and 70% answered “No”! • A later survey, more carefully designed, showed 90% of parents are happy with their decision to have children. • In a voluntary response sample, a large group of individuals is invited to respond, and all who do respond are counted. But such samples are almost always biased toward those with strong opinions or those who are strongly motivated. • Since the sample is not representative, the resulting voluntary response bias invalidates the survey.

What Can Go Wrong?—or,How to Sample Badly • In convenience sampling, we simply include the individuals who are convenient. But they may not be representative of the population. • A psychology professor performs an experiment using his classroom. • A company samples opinions by using its own customers. • Sampling mice from a large cage to study how a drug affects physical activity: The lab assistant reaches into the cage to select the mice one at a time until 10 are chosen. But which mice will likely be chosen?

Other problems • Under-coverage: • In some survey designs a portion of the population is not sampled or has a smaller representation in the sample than it has in the population. • Using telephone directories for phone survey. • Half the households in large cities are unlisted. • About 5% of households don’t have phones. • Random digit dialing only partially addresses this problem • Misses students in dorms, inmates in prison, soldiers in the military, homeless people. And it’s too expensive to call Hawaii or Alaska. • Non-response • No survey succeeds in getting responses from everyone. • The problem is that those who don’t respond may differ from those who do. • Bureau of Labor Statistics get 6-7% non-response rate. • But it’s common for opinion polls and market research studies to have 75- 80% non-response rate.

What Else Can Go Wrong? • Response bias refers to anything in the survey design that influences the responses • In particular, the wording of a question can have a big impact on the responses:

Some classic statistical mistakesThe Literary Digest Poll • 1936 presidential election: Franklin Delano Roosevelt vs. Alf Landon • The Literary Digest had called every presidential election since 1916 • Sample size: 2.4 million! • They predicted Roosevelt would lose by 43% • In fact it was a landslide for Roosevelt at 62%

Literary Digest poll • Context • Midst of the Great Depression • 9 million unemployed; real income down 1/3 • Landon’s program: “Cut spending” • Roosevelt’s program: “Balance peoples’ budgets before the government’s budget” • How the polling was done • Survey sent to 10 million people • And 2.4 million responded (that’s huge!)

A huge sample, but TheLiterary Digest poll was biased • The sampling frame was not representative of the electorate—selection bias • Based on magazine subscription lists, drivers’ registrations, country club memberships, phone numbers (when telephones were a luxury) • Biased toward better off groups (who were more Republican) • Voluntary response bias • Main issue was the economy • The anti-Roosevelt forces were angry—and had a higher response rate!

The Year the Polls Elected Dewey • 1948 Election: Harry Truman versus Thomas Dewey • Every major poll (including Gallup) predicted Dewey would win by 5 percentage points

What went wrong? • Pollsters chose their samples using quota sampling. Each interviewer was assigned a fixed quota of subjects in certain categories (race, sex, age). • For instance, an interviewer in St. Louis was required to talk to 13 people: • 6 live in the suburb, 7 in the central city • 7 men and 6 women; Over the 7 men (similar for women): • 3 under 40 years old, 4 over 40; 1 black, 6 white. • In each category, interviewers were free to choose. • But this left room for human choice and inevitable bias. • Republicans were easier to reach. They had telephones, permanent addresses, “nicer” neighborhoods. • So interviewers ended up with too many Republicans. • Quota sampling was abandoned for random sampling.

Do you believe the poll?What questions should you ask? • Who carried out survey? • What is the population? • How was sample selected? • How large was the sample? • What was the response rate? • How were subjects contacted? • When was the survey conducted? • What are the exact questions asked?

To summarize . . . • We are often interested in a population and some parameter that describes the population. • We select a sample from that population and use a statistic from the sample to estimate the unknown parameter • To obtain a good estimate, the sample must be as representative of the population as possible. And randomization, on average, insures a representative sample • Possible sources of error are samplingvariability and bias. • To reduce sampling variability, take a bigger sample • To reduce bias, get a better sampling design • It’s the sample size, not the population size, that matters

GATHERING AND PRODUCING DATA