The Scientific Study of Politics (POL 51)

The Scientific Study of Politics (POL 51) Professor B. Jones University of California, Davis

Today • Sampling Plans • Survey Research

Populations • Key Concepts • Population • Defined by the research • “All U.S. citizens age 18 or older.” • All democratic countries • Counties in the United States • Characteristics of a Population • Bounded and definable • If you can’t define the population, you probably don’t have a well formed research question!

Populations vs. Samples • Populations are often unattainable • TOO BIG (U.S. population) • Very Costly to Obtain • May not be necessary • The beauty of statistical theory • Samples • Simply Defined: a subset of the population chosen in some manner • How you choose is the important question!

Moving Parts of a Sample • Units of Analysis • J is the population • i is a member of J • Then i is a “sample element” • Sampling Frames • The actual source of the data • Literary Digest Poll (1936) • “Dewey Defeats Truman” (1948) • Exit Polls

More Moving Parts • Sampling Unit • Could be same as sample element (Unit of Analysis) • But it could be collections of elements (cluster, stratified sampling) • Sampling Plan • Random? Nonrandom?

Kinds of Samples • Simple Random Sample • Major Characteristic: Every sample element has an equi-probable chance of selection. • If done properly, maximizes the likelihood of a representative sample. • What if your assumptions of randomness goes badly? • Nonrandom samples (often) produce nonrepresentative surveys.

Why Randomness is Goodness • Nonprobability Sampling • Probability of “getting into” the sample is unknown • All bets are off; inference most likely impossible • Highly unreliable! • Simple Random Sampling • Every sample element has the same probability of being selected: Pr(selection)=1/N • In practice, not always easy to guarantee or achieve • An Example of a Bad Assumption

Some Data

More Data

Getting Probability Samples Wrong

Draft Lottery • Simple random sampling did not exist. • Avg. Lottery Number Jan.-June: 206 • Avg. Lottery Number July-Dec.: 161 • Avg. Deaths Jan.-June: 159 • Avg. Deaths July-Dec.: 111 • Differences highly significant. • Its absence had profound consequences. • Randomness should have ensured an equal chance of draft, invariant to birth date. It didn’t. • By analogy, suppose college admissions were based on this kind of lottery… • Those of you born later in the year would be less likely to be admitted. • Would you consider that fair?

How to Achieve Randomness • Random number generation • Modern computers are really good at this. • Assign sample elements a number • Generate a random numbers table • Use a decision rule upon which to select sample. • The Key: sampled units are randomly drawn. • Why Important? Randomness helps ensure REPRESENTATIVENESS! • Absent this, all bets are off: • Convenience Polls • Push Polls • Person-on-the-Street Interviews

A Population and Some “Samples” • A “Population” • Striations represent “attitudes” • Some “Samples”

Other Kinds of Sampling Strategies • Stratified Samples: a probability sample in which elements sharing some characteristic are grouped and then sample elements are randomly chosen from each group. • Benefit? Can ensure more representative sample with smaller sample sizes. • Why might this be the case?

Sampling come to life in…R!!!  • Suppose we have a population of 100,000 • And in that population, we have 4 groups • Group 1: 13,000 (13 percent) • Group 2: 12,000 (12 percent) • Group 3: 4,000 ( 4 percent) • Group 4: 70,000 (70 percent) • Racial/Ethnic Characteristics in the US: US Census • White (69.13 percent) • Black (12.06 percent) • Hispanic (12.55 percent) • Asian (3.6 percent) • Some R Code

R #Creating a population of 100,000 consisting of 4 groups set.seed(535126235) population<- rep(1:4,c(13000, 12000, 4000, 70000)) #Tabulating the population (ctab requires package catspec) ctab(table(population)) #Tabulating the population (ctab requires package catspec) (btw, not sure why percents are not whole numbers) ctab(table(population)) Count Total % population 1 13000.00 13.13 2 12000.00 12.12 3 4000.00 4.04 4 70000.00 70.71

Sampling • What do we expect from random sampling? • That each sample reproduces the population proportions. • Let’s consider SIMPLE RANDOM SAMPLES. • Also, let’s consider small samples (size 100) • …which is a .001 percent sample.

R: 3 samples of n=100 #Three Simple Random Samples without Replacement; n=100 which is a .001 percent sample #The set.seed command ensures I can exactly replicate the simulations set.seed(15233) srs1<-sample(population, size=100, replace=FALSE) ctab(table(srs1)) set.seed(5255563) srs2<-sample(population, size=100, replace=FALSE) ctab(table(srs2)) set.seed(5255) srs3<-sample(population, size=100, replace=FALSE) ctab(table(srs3))

R: Sample Results > set.seed(15233) > srs1<-sample(population, size=100, replace=FALSE) > ctab(table(srs1)) Count Total % srs1 1 19 19 2 13 13 3 5 5 4 63 63 > set.seed(5255563) > srs2<-sample(population, size=100, replace=FALSE) > ctab(table(srs2)) Count Total % srs2 1 16 16 2 8 8 3 4 4 4 72 72 > set.seed(5255) > srs3<-sample(population, size=100, replace=FALSE) > ctab(table(srs3)) Count Total % srs3 1 12 12 2 9 9 3 1 1 4 78 78

Implications? • Small samples? • Variability in proportion of groups. • Why does this occur? • Let’s understand stratification. • What does it do? • You’re sampling within strata. • Suppose we know the population proportions?

R: Identifying Strata and then Sampling from them. #Stratified Sampling #Creating the Groupings strata1<- rep(1,c(13000)) strata2<- rep(1,c(12000)) strata3<- rep(1,c(4000)) strata4<- rep(1,c(70000)) #Sampling by strata #Selection observations proportional to known population values: Proportionate Sampling set.seed(52524425) srs4<-sample(strata1, size=13, replace=FALSE) ctab(table(srs4)) set.seed(4244225) srs5<-sample(strata2, size=12, replace=FALSE) ctab(table(srs5)) set.seed(33325) srs6<-sample(strata3, size=4, replace=FALSE) ctab(table(srs6)) set.seed(1114225) srs7<-sample(strata4, size=70, replace=FALSE) ctab(table(srs7))

R: Results? Proportional Sampling w/small samples. > srs4<-sample(strata1, size=13, replace=FALSE) > ctab(table(srs4)) Count Total % srs4 1 13 100 > > set.seed(4244225) > srs5<-sample(strata2, size=12, replace=FALSE) > ctab(table(srs5)) Count Total % srs5 1 12 100 > > set.seed(33325) > srs6<-sample(strata3, size=4, replace=FALSE) > ctab(table(srs6)) Count Total % srs6 1 4 100 > > set.seed(1114225) > srs7<-sample(strata4, size=70, replace=FALSE) > ctab(table(srs7)) Count Total % srs7 1 70 100

Proportionate Sampling • What do we see? • If we know the proportions of the relevant stratification variable(s)… • Then sample from the groups. • SMALL SAMPLES can reproduce certain characteristics of the sample. • But of course, it is probabilistic.

Disproportionate Sampling • Why? • “Oversampling” may be of interest when research centers on small pockets in the population. • Race is often an issue in this context.

R: Disproportionate Sampling > #Sampling by strata > #Selection observations disproportional to known population values: disproportionate Sampling > #"Oversampling by Race" > set.seed(5555425) > srs8<-sample(strata1, size=24, replace=FALSE) > ctab(table(srs8)) Count Total % srs8 1 24 100 > > set.seed(4222225) > srs9<-sample(strata2, size=22, replace=FALSE) > ctab(table(srs9)) Count Total % srs9 1 22 100 > > set.seed(103325) > srs10<-sample(strata3, size=14, replace=FALSE) > ctab(table(srs10)) Count Total % srs10 1 14 100 > > set.seed(11534) > srs11<-sample(strata4, size=70, replace=FALSE) > ctab(table(srs7)) Count Total % srs7 1 70 100 >

Disproportionate Samples • What did I ask R to do? • I “oversampled” for some groups. • Again, understand why we, as researchers, might want to do this.

Side-trip: Sample Sizes • Who is happy with a .001 percent SRS? • On the other hand… • What do we get from a stratified sample? • Suppose we increase n in a SRS? • It’s R time!

R: SRS with a 1 percent sample > #Sample Size=1000 > > set.seed(1775233) > srs1<-sample(population, size=1000, replace=FALSE) > ctab(table(srs1)) Count Total % srs1 1 129.0 12.9 2 97.0 9.7 3 46.0 4.6 4 728.0 72.8 > > set.seed(5200563) > srs2<-sample(population, size=1000, replace=FALSE) > ctab(table(srs2)) Count Total % srs2 1 117.0 11.7 2 127.0 12.7 3 41.0 4.1 4 715.0 71.5 > > set.seed(52909) > srs3<-sample(population, size=1000, replace=FALSE) > ctab(table(srs3)) Count Total % srs3 1 147.0 14.7 2 126.0 12.6 3 39.0 3.9 4 688.0 68.8 >

Implications? • Sample Size MATTERS • What do we see? • Note, again, what stratification “buys” us. • The issues with stratification? • Another R example (code posted on website)

R • We have again 4 sample elements • > set.seed(52352) • > urn<-sample(c(1,2,3,4),size=1000, replace=TRUE) • > • > ctab(table(urn)) • Count Total % • urn • 1 239.0 23.9  My Population • 2 253.0 25.3 • 3 268.0 26.8 • 4 240.0 24.0

R version of a person-on-the-street interview > #Convenience Sample: What shows up > > con<-matrixurn[1:10]; con [1] 1 1 1 3 4 2 4 3 4 3 > > ctab(table(con)) Count Total % con 1 3 30 2 1 10 3 3 30 4 3 30

R and Samples, redux • What do we find? • Very unreliable sample: we oversample some groups, undersample others. • Useless data more than likely. • What do you imagine happens when we increase the sample sizes?

R and SRS with samples of size N /*Sample: Sizes 10, 50, 75, 100, 200, 250, 900, 1000*/ set.seed(562) s1<-sample(urn, 10, replace=FALSE) ctab(table(s1)) set.seed(58862) s1a<-sample(urn, 50, replace=FALSE) ctab(table(s1a)) set.seed(562657) s1b<-sample(urn, 75, replace=FALSE) ctab(table(s1b)) set.seed(58862) s2<-sample(urn, 100, replace=FALSE) ctab(table(s2)) set.seed(58862) s3<-sample(urn, 200, replace=FALSE) ctab(table(s3)) set.seed(10562) s4<-sample(urn, 250, replace=FALSE) ctab(table(s4)) set.seed(22562) s5<-sample(urn, 900, replace=FALSE) ctab(table(s5)) set.seed(56882) s6<-sample(urn, 1000, replace=FALSE) ctab(table(s6))

Sampling and Sample Size > /*Sample: Sizes 10, 50, 75, 100, 200, 250, 900, 1000*/ Error: unexpected '/' in "/" > > set.seed(562) > s1<-sample(urn, 10, replace=FALSE) > ctab(table(s1)) Count Total % s1 1 2 20 2 4 40 3 2 20 4 2 20 > > set.seed(58862) > s1a<-sample(urn, 50, replace=FALSE) > ctab(table(s1a)) Count Total % s1a 1 13 26 2 13 26 3 13 26 4 11 22 >

Sample Sizes > > > set.seed(562657) > s1b<-sample(urn, 75, replace=FALSE) > ctab(table(s1b)) Count Total % s1b 1 22.00 29.33 2 18.00 24.00 3 22.00 29.33 4 13.00 17.33 > > set.seed(58862) > s2<-sample(urn, 100, replace=FALSE) > ctab(table(s2)) Count Total % s2 1 27 27 2 24 24 3 22 22 4 27 27 >

Sample Size > set.seed(58862) > s3<-sample(urn, 200, replace=FALSE) > ctab(table(s3)) Count Total % s3 1 54 27 2 48 24 3 48 24 4 50 25 > > > set.seed(10562) > s4<-sample(urn, 250, replace=FALSE) > ctab(table(s4)) Count Total % s4 1 62.0 24.8 2 67.0 26.8 3 56.0 22.4 4 65.0 26.0 >

Sample Size > set.seed(22562) > s5<-sample(urn, 900, replace=FALSE) > ctab(table(s5)) Count Total % s5 1 220.00 24.44 2 231.00 25.67 3 234.00 26.00 4 215.00 23.89 > > set.seed(56882) > s6<-sample(urn, 1000, replace=FALSE) > ctab(table(s6)) Count Total % s6 1 239.0 23.9 2 253.0 25.3 3 268.0 26.8 4 240.0 24.0 > >

R: What did we learn? • Sample size seems to have some impact here. • But there are trade-offs.

Important Moving Parts • Randomness (covered!) • Sampling Frame • Random sampling from a bad sampling frame produces bad samples. • Sample Size • What is your intuition about sample sizes? • Must they always be large? • Not necessarily so…although…

Bad Sampling • Person-on-the-Street Interviews • What do these imply? • Small samples and inherently nonrandom • Likely poor inference. • Other examples? • Not all non-random samples are necessarily bad • Purposive Samples

The Scientific Study of Politics (POL 51)