380 likes | 391 Views
Learn about the importance of sampling in statistical data collection, including how to choose a representative sample and the impact of sampling frames on survey results.
E N D
13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies
A Survey The practical alternative to a census is to collect data only from some membersof the population and use that data to draw conclusions and make inferences aboutthe entire population. Statisticians call this approach a survey (or a pollwhen thedata collection is done by asking questions). The subgroup chosen to provide thedata is called the sample, and the act of selecting a sample is called sampling.
A Survey Ideally, every member of the population should have an opportunity to be chosen as part of the sample, but this is possible only if we have a mechanism to identify each and every member of the population. In many situations this is impossible.Say we want to conduct a public opinion poll before an election. The population forthe poll consists of all voters in the upcoming election, but how can we identify whois and is not going to vote ahead of time? We know who the registered voters are,but among this group there are still many nonvoters.
A Survey The first important step in a survey is to distinguish the population for whichthe survey applies (the target population) and the actual subset of the populationfrom which the sample will be drawn, called the sampling frame. The ideal scenario is when the sampling frame is the same as the target population–that would meanthat every member of the target population is a candidate for the sample. When thisis impossible (or impractical), an appropriate sampling frame must be chosen.
Example 13.5 Sampling Frames Can Make a Difference A CNN/USA Today/Gallup poll conducted right before the November 2, 2004,national election asked the following question: “If the election for Congress werebeing held today, which party’s candidate would you vote for in your congressional district, the Democratic Party’s candidate or the Republican Party’s candidate?”
Example 13.5 Sampling Frames Can Make a Difference When the question was asked of 1866 registered voters nationwide, the resultsof the poll were 49% for the Democratic Party candidate, 47% for the RepublicanParty candidate, 4% undecided.When exactly the same question was asked of 1573 likely voters nationwide,the results of the poll were 50% for the Republican Party candidate, 47% for theDemocratic Party candidate, 3% undecided.
Example 13.5 Sampling Frames Can Make a Difference Clearly, one of the two polls had to be wrong, because in the first poll theDemocrats beat out the Republicans, whereas in the second poll it was the otherway around. The only significant difference between the two polls was the choiceof the sampling frame–in the first poll the sampling frame used was all registeredvoters, and in the second poll the sampling frame used was all likely voters.
Example 13.5 Sampling Frames Can Make a Difference Although neither one faithfully represents the target population of actual voters,using likely voters instead of registered voters for the sampling frame gives muchmore reliable data. (The second poll predicted very closely the average results ofthe 2004 congressional races across the nation.) So, why don’t all pre-election polls use likely voters as a sampling frame instead of registered voters?
Example 13.5 Sampling Frames Can Make a Difference The answer is economics. Registered voters are relatively easy to identify–every county registrar can produce an accurate list ofregistered voters. Not every registered voter votes, though, and it is much harderto identify those who are “likely” to vote. Typically, one has to look at demographic factors (age, ethnicity, etc.) aswell as past voting behavior to figure out who is likely to vote and who isn’t.Doing that takes a lot more effort, time, and money.
Sampling The basic philosophy behind sampling is simple and well understood–if wehave a sample that is “representative” of the entire population, then whatever wewant to know about a population can be found out by getting the informationfrom the sample. If we are todraw reliable data from a sample, we must (a) find a sample that is representativeof the population, and (b) determine how big the sample should be. These twoissues go hand in hand, and we will discuss them next.
Sampling Sometimes a very small sample can be used to get reliable information abouta population, no matter how large the population is. This is the case when thepopulation is highly homogeneous. The more heterogeneous a population gets, the more difficult it is to find arepresentative sample. The difficulties can be well illustrated by taking a look atthe history of public opinion polls.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The U.S. presidential election of 1936 pitted Alfred Landon, the Republicangovernor of Kansas, against the incumbent Democratic President, Franklin D. Roosevelt. At the time of the election, the nation had not yet emerged fromthe Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest, one of the most respected magazines of the time, conducted a poll a couple of weeks before the election. The magazine had used pollsto accurately predict the results of every presidential election since 1916, andtheir 1936 poll was the largest and most ambitious poll ever.The sampling framefor the Literary Digest poll consisted of an enormous list of names that included:
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL (1) every person listed in a telephone directory anywhere in the United States,(2) every person on a magazine subscription list, and(3) every person listed onthe roster of a club or professional association. From this sampling frame a list ofabout 10 million names was created, and every name on this list was mailed amock ballot and asked to mark it and return it to the magazine.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL Based on the poll results, the Literary Digestpredicted a landslide victory forLandon with 57% of the vote, against Roosevelt’s 43%. Amazingly, the electionturned out to be a landslide victory for Roosevelt with 62% of the vote, against 38% for Landon. The difference between the poll’s prediction and the actualelection results was a whopping 19%, the largest error ever in a majorpublic opinion poll.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL For the same election, a young pollster named George Gallup was able to predict accurately a victory for Roosevelt using a sampleof “only”50,000 people. In fact, Gallup also publicly predicted, towithin 1%, the incorrect results that the Literary Digestwould getusing a sample of just 3000 people taken from the same samplingframe the magazine was using. What went wrong with the LiteraryDigest poll and why was Gallup able to do so much better?
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The first thing seriously wrong with the Literary Digestpoll wasthe sampling frame, consisting of names taken from telephonedirectories, lists of magazine subscribers, rosters of club members, andso on. Telephones in 1936 were something of a luxury, and magazinesubscriptions and club memberships even more so, at a time when 9 million people were unemployed.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL When it came to economic statusthe Literary Digestsample was far from being a representative crosssection of the voters. This was a critical problem, because voters oftenvote on economic issues, and given the economic conditions of the time,this was especially true in 1936.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL When the choice of the sample has a built-in tendency (whether intentionalor not) to exclude a particular group or characteristic within the population, wesay that a survey suffers from selection bias. It is obvious that selection bias mustbe avoided, but it is not always easy to detect it ahead of time. Even the mostscrupulous attempts to eliminate selection bias can fall short.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The second serious problem with the Literary Digestpoll was the issue ofnonresponse bias. In a typical survey it is understood that not every individual iswilling to respond to the survey request (and in a democracy we cannot forcethem to do so). Those individuals who do not respond to the survey request arecalled nonrespondents, and those who do are called respondents. The percentageof respondents out of the total sample is called the response rate.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL For the LiteraryDigest poll, out of a sample of 10 million people who were mailed a mock ballotonly about 2.4 million mailed a ballot back, resulting in a 24% response rate.When the response rate to a survey is low, the survey is said to suffer fromnonresponse bias.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL One of the significant problems with the Literary Digest poll was that the pollwas conducted by mail. This approach is the most likely to magnify nonresponsebias, because people often consider a mailed questionnaire just another form ofjunk mail. Of course, given the size of their sample, the Literary Digest hardly hada choice. This illustrates another important point: Bigger is not better, and a bigsample can be more of a liability than an asset.
CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest story has two morals: (1) You’ll do better with a well-chosensmall sample than with a badly chosen large one, and (2) watch out for selectionbias and nonresponse bias.
Convenience Sampling One commonly used short-cut in sampling is known as convenience sampling. In convenience sampling theselection of which individuals are in the sample is dictated by what is easiest orcheapest for the data collector, never mind trying to get a representative sample. A classic example of convenience sampling is when interviewers set up at afixed location such as a mall or outside a supermarket and ask passersby to bepart of a public opinion poll.
Convenience Sampling A different type of convenience sampling occurswhen the sample is based on self-selection–the sample consists of those individualswho volunteer to be in it. Self-selection is the reason why many Area Code 800polls are not to be trusted.Convenience sampling is not always bad–at times there is no other choice orthe alternatives are so expensive that they have to be ruled out.
Convenience Sampling We should keepin mind, however, that data collected through convenience sampling are naturallytainted and should always be scrutinized (that’s why we always want to get to thedetails of how the data were collected). More often than not, convenience sampling gives us data that are too unreliable to be of any scientific value. With data,as with so many other things, you get what you pay for.
Quota Sampling Quota sampling is a systematic effort to force the sample to be representative of agiven population through the use of quotas–the sample should have so manywomen, so many men, so many blacks, so many whites, so many people living inurban areas, so many people living in rural areas, and so on. The proportions ineach category in the sample should be the same as those in the population.
Quota Sampling If wecan assume that every important characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will berepresentative of the population and produce reliable data.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION George Gallup had introduced quota sampling as early as 1935 and had used itsuccessfully to predict the winner of the 1936, 1940, and 1944 presidential elections. Quota sampling thus acquired the reputation of being a “scientificallyreliable” sampling method, and by the 1948 presidential election all three majornational polls–the Gallup poll, the Roper poll, and the Crossley poll–usedquota sampling to make their predictions.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION For the 1948 election between Thomas Dewey and Harry Truman, Gallupconducted a poll with a sample of approximately 3250 people. Each individual in the sample was interviewed in person by a professional interviewer to minimizenonresponse bias, and each interviewer was given a very detailed set of quotasto meet–for example, 7 white males under 40 living in a rural area, 5 blackmales over 40 living in a rural area, 6 white females under 40 living in an urbanarea, and so on.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION By the time all the interviewers met their quotas, the entiresample was expected to accurately represent the entire population in every respect:gender, race, age, and so on. Based on his sample, Gallup predicted that Dewey, the Republican candidate, would win the election with 49.5% of the vote to Truman’s 44.5% (withthird-party candidates Strom Thurmond and Henry Wallace accounting for theremaining 6%).
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION The Roper and Crossley polls also predicted an easy victory forDewey. The actual results of the election turnedout to be almost the exact reverse of Gallup’s prediction: Truman got 49.9% andDewey 44.5% of the national vote. Truman’s victory was a great surprise to the nation as awhole. So convinced was the Chicago Daily TribuneofDewey’s victory that it went to press on its early editionfor November 4, 1948, with the headline “Deweydefeats Truman.”
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION The picture of Truman holding aloft acopy of the Tribune and his famous retort “Ain’t theway I heard it” have become part of our nationalfolklore.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION To pollsters and statisticians, the erroneous predictions of the 1948 election had two lessons: (1) Poll untilelection day, and (2) quota sampling is intrinsically flawed. What’s wrong with quota sampling? After all, the basic ideabehind it appears to be a good one: Force the sample to be a representative cross section of the population by having each importantcharacteristic of the population proportionally represented in thesample.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION Where do we stop? No matter how careful wemight be, we might miss some criterion that would affect the way people vote,and the sample could be deficient in this regard. An even more serious flaw in quota sampling is that, other than meeting thequotas, the interviewers are free to choose whom they interview. This opens thedoor to selection bias. Looking back over the history of quota sampling, we can seea clear tendency to overestimate the Republican vote.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION In 1936, using quota sampling, Gallup predicted that the Republican candidate would get 44% of the vote,but the actual number was 38%.In 1940, the prediction was 48%, and the actualvote was 45%; in 1944, the prediction was 48%, and the actual vote was 46%.Gallup was able to predict the winner correctly in each of these elections, mostlybecause the spread between the candidates was large enough to cover the error. In1948, Gallup (and all the other pollsters) simply ran out of luck.
CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION It was time to ditchquota sampling. The failure of quota sampling as a method for getting representative sampleshas a simple moral: Even with the most carefully laid plans, human intervention inchoosing the sample can result in selection bias.