1 / 75

13 Collecting Statistical Data

13 Collecting Statistical Data. 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies. The Population.

gari
Download Presentation

13 Collecting Statistical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

  2. The Population Every statistical statement refers, directly or indirectly, to some group of individualsor objects. In statistical terminology, this collection of individuals or objects is calledthe population. The first question we should ask ourselves when trying to make senseof a statistical statement is, “what is the population to which the statement applies?”

  3. The Population In an ideal world, the specific population to which a statistical statement applies is clearly identified within the story itself. In the real world, this rarely happens because the details are skipped (mostly to keep the story moving along butsometimes with an intent to confuse or deceive) or, alternatively, because two (ormore) related populations are involved in the story.

  4. The N Value Given a specific population, an obviously relevant question is, “How many individuals or objects are there in that population?” This number is called the N-value ofthe population. (It is common practice in statistics to use capital N to denote population sizes.) It is important to keep in mind the distinction between the N-value–a number specifying the size of the population–and the population itself.

  5. Example 13.2 The Return of theBald Eagle: Part 2 Over a period of many years, the United States Fish and Wildlife Service was ableto keep a remarkably accurate tally of the number of bald eagle breeding pairs inthe contiguous 48 states. (breeding pairs are usedas a useful proxy for the health of the overall population.) A tremendous amountof effort has gone into collecting and verifying these N-values, which, for a wildlifepopulation, are of remarkable accuracy.

  6. Example 13.3 N Is in the Eye of the Beholder Andy has a coin jar full of quarters. He is hoping that there is enough money inthe jar to pay for a new baseball glove. Dad says to go count them, and if thereisn’t enough, he will lend Andy the difference. Andy dumps the quarters out ofthe jar, makes a careful tally, and comes up with a count of 116 quarters.

  7. Example 13.3 N Is in the Eye of the Beholder What is the N-value here? The answer depends on how we define the population. Are we counting coins or money? To Dad, who will end up stuck with allthe quarters, the total number of coins might be the most relevant issue. Thus, toDad, N = 116.Andy, on the other hand, is concerned with how much money is inthe jar. If he were to articulate his point of view in statistical language, he wouldsay that N = 29(dollars).

  8. Data The word data is the plural of the Latin word datum, meaning “somethinggiven,” and in ordinary usage has a somewhat broader meaning than the one we will give it in this chapter. For our purposes we will use the word data as any type ofinformation packaged in numerical form, and we will adhere to the standard convention that as a noun it can be used both in singular (“the data is…”), and plural(“the data are…”) forms.

  9. Census The process of collecting data by going through everymember of the population is called a census. Theidea behind a census is simple enough, but in practice a census requires a great dealof “cooperation” from the population. For larger, more dynamic populations (wildlife, humans, etc.), accurate tallies are inherently difficult if not impossible, and in these cases the best we can hope for is a good estimate of the N-value.

  10. Example 13.4 2000 Census Undercounts The most notoriously difficult N-value question around is,”What is the N-valueof the national population of the United States?” This is a question the UnitedStates Census tries to answer every 10 years–with very little success.The 2000 U.S.Census wasthe largest single peacetimeundertaking of the federalgovernment–it employedover 850,000 people and costabout $6.5 billion– and yet itmissed counting between 3and 4 million people.

  11. Example 13.4 2000 Census Undercounts Given the critical importance of the U.S. Census and given the tremendous resources put behind the effort by the federal government, why is the head countso far off? How can the bestintentions and tremendousresources of our government fail so miserably in anactivity that on a smaller scale can be carried out by a child trying to buy a base-ball glove?

  12. Taking a Census Nowadays, the notion that if we put enough money and effort into it, all individuals living in the United States can be counted like coins in a jar is unrealistic.In 1790, when the first U.S. Census was carried out, the population was smallerand relatively homogeneous, as people tended to stay in one place, and, by andlarge, they felt comfortable in their dealings with the government. Under theseconditions it might have been possible for census takers to count heads accurately.

  13. Taking a Census Today’s conditions are completely different. People are constantly on the move.Many distrust the government. In large urban areas many people are homeless ordon’t want to be counted. And then there is the apathy of many people who thinkof a census form as another piece of junk mail.

  14. Taking a Census If the Census undercount were consistent among all segments of the population, the undercount problem could be solved easily. Unfortunately, the modernU.S. Census is plagued by what is known as a differential undercount. Ethnicminorities, migrant workers, and the urban poor populations have significantlylarger undercount rates than the population at large, and the undercount rates varysignificantly within these groups.

  15. Taking a Census Using modern statistical techniques, it is possibleto make adjustments to the raw Census figures that correct some of the inaccuracycaused by the differential undercount, but in 1999 the Supreme Court ruled inDepartment of Commerce et al. v. United States House of Representatives et al. thatonly the raw numbers, and not statistically adjusted numbers, can be used for thepurposes of apportionment of Congressional seats among the states.

  16. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

  17. A Survey The practical alternative to a census is to collect data only from some membersof the population and use that data to draw conclusions and make inferences aboutthe entire population. Statisticians call this approach a survey (or a pollwhen thedata collection is done by asking questions). The subgroup chosen to provide thedata is called the sample, and the act of selecting a sample is called sampling.

  18. A Survey Ideally, every member of the population should have an opportunity to be chosen as part of the sample, but this is possible only if we have a mechanism to identify each and every member of the population. In many situations this is impossible.Say we want to conduct a public opinion poll before an election. The population forthe poll consists of all voters in the upcoming election, but how can we identify whois and is not going to vote ahead of time? We know who the registered voters are,but among this group there are still many nonvoters.

  19. A Survey The first important step in a survey is to distinguish the population for whichthe survey applies (the target population) and the actual subset of the populationfrom which the sample will be drawn, called the sampling frame. The ideal scenario is when the sampling frame is the same as the target population–that would meanthat every member of the target population is a candidate for the sample. When thisis impossible (or impractical), an appropriate sampling frame must be chosen.

  20. Sampling The basic philosophy behind sampling is simple and well understood–if wehave a sample that is “representative” of the entire population, then whatever wewant to know about a population can be found out by getting the informationfrom the sample. If we are todraw reliable data from a sample, we must (a) find a sample that is representativeof the population, and (b) determine how big the sample should be. These twoissues go hand in hand, and we will discuss them next.

  21. Sampling Sometimes a very small sample can be used to get reliable information abouta population, no matter how large the population is. This is the case when thepopulation is highly homogeneous. The more heterogeneous a population gets, the more difficult it is to find arepresentative sample.

  22. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL When the choice of the sample has a built-in tendency (whether intentionalor not) to exclude a particular group or characteristic within the population, wesay that a survey suffers from selection bias. It is obvious that selection bias mustbe avoided, but it is not always easy to detect it ahead of time. Even the mostscrupulous attempts to eliminate selection bias can fall short.

  23. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL In a typical survey it is understood that not every individual iswilling to respond to the survey request (and in a democracy we cannot forcethem to do so). Those individuals who do not respond to the survey request arecalled nonrespondents, and those who do are called respondents. The percentageof respondents out of the total sample is called the response rate.

  24. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL For the LiteraryDigest poll, out of a sample of 10 million people who were mailed a mock ballotonly about 2.4 million mailed a ballot back, resulting in a 24% response rate.When the response rate to a survey is low, the survey is said to suffer fromnonresponse bias.

  25. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL One of the significant problems with the Literary Digest poll was that the pollwas conducted by mail. This approach is the most likely to magnify nonresponsebias, because people often consider a mailed questionnaire just another form ofjunk mail. Of course, given the size of their sample, the Literary Digest hardly hada choice. This illustrates another important point: Bigger is not better, and a bigsample can be more of a liability than an asset.

  26. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest story has two morals: (1) You’ll do better with a well-chosensmall sample than with a badly chosen large one, and (2) watch out for selectionbias and nonresponse bias.

  27. Convenience Sampling One commonly used short-cut in sampling is known as convenience sampling. In convenience sampling theselection of which individuals are in the sample is dictated by what is easiest orcheapest for the data collector, never mind trying to get a representative sample. A classic example of convenience sampling is when interviewers set up at afixed location such as a mall or outside a supermarket and ask passersby to bepart of a public opinion poll.

  28. Convenience Sampling A different type of convenience sampling occurswhen the sample is based on self-selection–the sample consists of those individualswho volunteer to be in it. Self-selection is the reason why many Area Code 800polls are not to be trusted.Convenience sampling is not always bad–at times there is no other choice orthe alternatives are so expensive that they have to be ruled out.

  29. Convenience Sampling We should keepin mind, however, that data collected through convenience sampling are naturallytainted and should always be scrutinized (that’s why we always want to get to thedetails of how the data were collected). More often than not, convenience sampling gives us data that are too unreliable to be of any scientific value. With data,as with so many other things, you get what you pay for.

  30. Quota Sampling Quota sampling is a systematic effort to force the sample to be representative of agiven population through the use of quotas–the sample should have so manywomen, so many men, so many blacks, so many whites, so many people living inurban areas, so many people living in rural areas, and so on. The proportions ineach category in the sample should be the same as those in the population.

  31. Quota Sampling If wecan assume that every important characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will berepresentative of the population and produce reliable data.

  32. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION What’s wrong with quota sampling? After all, the basic ideabehind it appears to be a good one: Force the sample to be a representative cross section of the population by having each importantcharacteristic of the population proportionally represented in thesample.

  33. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION Where do we stop? No matter how careful wemight be, we might miss some criterion that would affect the way people vote,and the sample could be deficient in this regard. An even more serious flaw in quota sampling is that, other than meeting thequotas, the interviewers are free to choose whom they interview. This opens thedoor to selection bias. Looking back over the history of quota sampling, we can seea clear tendency to overestimate the Republican vote.

  34. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

  35. Random Sampling The best alternative to human selection is to let the laws of chance determine theselection of a sample. Sampling methods that use randomness as part of their de-sign are known as random sampling methods, and any sample obtained throughrandom sampling is called a random sample (or a probability sample).

  36. Simple Random Sampling The most basic form of random sampling is called simple random sampling. It isbased on the same principle a lottery is. Any set of numbers of a given size has anequal chance of being chosen as any other set of numbers of that size. In theory, simple random sampling is easy to implement. We put thename of each individual in the population in “a hat,” mix the nameswell, and then draw as many names as we need for our sample. Ofcourse “a hat” is just a metaphor.

  37. Simple Random Sampling These days, the “hat” is a computer database containing a list of members of the population. A computer program thenrandomly selects the names. This is a fine idea for small, compact populations, but a hopeless one when it comes to national surveys and publicopinion polls. For most public opinion polls–especially those done on a regularbasis” the time and money needed to do this are simply not available.

  38. Stratified Sampling The alternative to simple random sampling used nowadays for national surveysand public opinion polls is a sampling method known as stratified sampling.The basic idea of stratified sampling is to break the sampling frame into categories called strata, and then (unlike quota sampling) randomly choose asample from these strata. The chosen strata are then further divided into categories, called substrata, and a random sample is taken from these substrata.

  39. Stratified Sampling The selected substrata are further subdivided, a random sample is taken fromthem, and so on. The process goes on for a predetermined number of steps(usually four or five).

  40. CASE STUDY 4 NATIONAL PUBLIC OPINION POLLS In national public opinion polls the strata and substrata are defined by a combination of geographic and demographic criteria. For example, the nation is firstdivided into “size of community”strata(big cities,medium cities, small cities,villages, rural areas, etc.).The strata are then subdivided by geographical region (New England, MiddleAtlantic, East Central, etc.). This is the first layer ofsubstrata.

  41. CASE STUDY 4 NATIONAL PUBLIC OPINION POLLS The efficiency of stratified sampling compared with simple random samplingin terms of cost and time is clear. The members of the sample are clustered in well-defined and easily manageable areas, significantly reducing the cost of conducting interviews as well as the response time needed to collect the data. For alarge, heterogeneous nation like the United States, stratified sampling has generallyproved to be a reliable way to collect national data.

  42. CASE STUDY 4 NATIONAL PUBLIC OPINION POLLS What about the size of the sample? Surprisingly, it does not have to be verylarge. Typically, a Gallup poll is based on samples consisting of approximately1500 individuals, and roughly the same size sample can be used to poll the populations of a small city as the population of the United States. The size of the sampledoes not have to be proportional to the size of the population.

  43. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

  44. Survey As we now know, except for a census, the common way to collect statistical information about a population is by means of a survey. When the survey consists of asking people their opinion on some issue, we refer to it as a publicopinion poll. In a survey, we use a subset of the population, called a sample, asthe source of our information, and from this sample, we try to generalize anddraw conclusions about the entire population.

  45. Statistic versus Parameter Statisticians use the term statisticto describe any kind of numerical information drawn from a sample. A statistic isalways an estimate for some unknown measure, called a parameter, of the population. A parameter is the numerical information we wouldlike to have. Calculating a parameter is difficult and often impossible, since the only way to getthe exact value for a parameter is to use a census. If we use a sample, then we canget only an estimate for the parameter, and this estimate is called a statistic.

  46. Sampling Error We will use the term sampling error to describe the difference between aparameter and a statistic used to estimate that parameter. In other words, thesampling error measures how much the data from a survey differs from the datathat would have been obtained if a census had been used. Sampling error can be attributed to two factors:chance error and samplingbias.

  47. Chance Error Chance error is the result of the basic fact that a sample, being just a sample,can only give us approximate information about the population. In fact, differentsamples are likely to produce different statistics for the same population, evenwhen the samples are chosen in exactly the same way–a phenomenon known assampling variability. While sampling variability, and thus chance error, are unavoidable, with careful selection of the sample and the right choice of sample size theycan be kept to a minimum.

  48. Sampling Bias Sample biasis the result of choosing a bad sample and is a much more serious problem than chance error. Even with the best intentions, getting asample that is representative of the entire population can be very difficult andcan be affected by many subtle factors. Sample bias is the result. As opposed to chance error, sample bias can be eliminated by using proper methods ofsample selection.

  49. Sampling Proportion Last, we shall make a few comments about the size of the sample, typicallydenoted by the letter n (to contrast with N, the size of the population). The ratio n/Nis called the sampling proportion. A sampling proportion of x% tells us thatthe size of the sample is intended to be x% of the population. Some samplingmethods are conducive for choosing the sample so that a given sampling proportion is obtained.

More Related