The Scientific Study of Politics (POL 51)

The Scientific Study of Politics (POL 51) Professor B. Jones University of California, Davis

Today • Sampling Plans • Survey Research

More fun with simulations samplesize<-10000 population<-rnorm(samplesize, 5, 2) truth<-mean(population) sdtruth<-sd(population) truth Sdtruth Here’s what I know in the “population”: > truth [1] 5.002265 > sdtruth [1] 2.003601

What do my samples look like? ten<-sample(population, 10, replace=F) m1<-mean(ten); m1 sd1<-sd(ten) hist(ten) fifty<-sample(population, 50, replace=F) m2<-mean(fifty); m2 sd2<-sd(fifty) hist(fifty) hundred<-sample(population, 100, replace=F) m3<-mean(hundred); m3 sd3<-sd(hundred) hist(hundred) . . .

Sampling Sizes • In general, we’ve seen larger sample sizes yield more accurate conclusions. • Though the differences between very large and just “merely” large samples may in fact be negligible. • Requires us to turn to the concept of repeated sampling and sample variability.

Polls and Repeated Sampling • As individual researchers, you usually have one “shot” at it. • Statistical theory (classical) relies on the concept of long-run probability • Repeated trials • …law of large numbers • …central limit theorem • Maybe concepts you have heard of before? …or not.

Side-trip to the 2008 Presidential Election • Pollster.com allows us to think about “repeated” sampling. • This site basis its analysis on all available polls • Why might this be a good thing? • There is sampling variability in individual samples. • Let’s look at polls that were leading up to the 2008 Election

What are the “dots” • The blue dots are Obama percentage (estimates) • The red dots are McCain • (Each blue dot has a corresponding red) • Note variability in samples: sampling frames, methodologies differ. • Combine them, and you get a better picture. • Look at solid red and blue states.

Polls • Note how the polls seem to be “clustering” as the election gets closer. • Why? • Undecideds deciding? • More certainty? • Let’s look at close states.

Understanding variability • We kind of see “repeated sampling” • The basic idea: • The “truth” will be revealed if you just sample enough • But any one sample may be off in one direction or another. • Back to sampling • Let’s simulate repeated sampling in R

More Simulation • The Population • N=1,000,000 • Mean of the Population is 0.4992135 • R Code: #"The Population" X<-runif(1000000,.01,.99) meanX <- mean(X); meanX

Let’s Sample n=500, 1000, 5000. • First Sample: Mean=.4692207 • Second Sample: Mean=.5004778 • Third Sample: Mean=.5027007 #Some Samples: First, sample 1, n=500, evaluate: set.seed(52151) nsamp <- 1 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 500, replace = FALSE)) mean(res) #Some Samples: Second, sample 2, n=1000, evaluate: set.seed(110789008) nsamp <- 1 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 1000, replace = FALSE)) mean(res) #Some Samples: Third, sample 3, n=5000, evaluate: set.seed(16978) nsamp <- 1 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 5000, replace = FALSE)) mean(res)

Repeated Sampling • Suppose we were to take 10 samples of size 500? [1,] 0.4922826 [2,] 0.5114829 [3,] 0.5006157 [4,] 0.5180107 [5,] 0.5083638 [6,] 0.5054319 [7,] 0.4992882 [8,] 0.4612303 [9,] 0.4897318 [10,] 0.5016498 Mean: 0.4988088 S.D.: 0.01568156

Lessons? • Sampling variability is a real issue. • Range in estimates went from .46 to .52 • Way under and way over estimate the mean in certain trials. • However, on average, “we’re close.” • More simulations.

Repeated Sampling • Experiment 1: 1000 samples, n=500 • Mean: 0.4994611 • S.D.: 0.01209907 set.seed(7869324) nsamp <- 1000 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 500, replace = FALSE)) mean(res); sd(res) hist(res, br=10, xlim=range(.5)) abline(v =meanX)

N=500, 1000 Samples

Repeated Sampling • Experiment 2: 1000 samples, n=1000 • Mean: 0.4988333 • S.D: 0.008994245 set.seed(7454) nsamp <- 1000 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 1000, replace = FALSE)) mean(res); sd(res) hist(res, br=10, xlim=range(.5) ) abline(v =meanX)

N=1000, 1000 Samples

Repeated Sampling • Experiment 3: 1000 samples, n=5000 • Mean: 0.499128 • S.D.: 0.004016436 set.seed(13433) nsamp <- 1000 res <- numeric(nsamp) for (i in 1:nsamp) res[i] <- mean(sample(X, 5000, replace = FALSE)) mean(res); sd(res) hist(res, br=10, xlim=range(.5)) abline(v =meanX)

N=5000, 1000 Samples

What’s going on?

Sampling Variability • If we “fix” the number of samples, what happened? • As n increases, variability decreases. • “On average, our sample estimate is “close” to the true value… • AND, the variation across samples is decreasing.

Theory • Population Parameter • θis the unknown parm. • What does this equality tell us? • How does it relate to samples?

Sample Proportions • In our examples, we wanted to estimate a proportion. • We knew it’s true value (we usually do not!) • We therefore must sample. • The same concept as before applies:

Probability • “Over repeated samples, the expected value of the proportion will equal the true population proportion.” • This is a good thing. • Sample estimates can do a good job of approximating the population value. • This permits generalizability. • Good sampling technique will produce “unbiased estimates.”

Repeated Sampling Redux • Suppose we were to take 10 samples of size 500? [1,] 0.4922826 [2,] 0.5114829 [3,] 0.5006157 [4,] 0.5180107 [5,] 0.5083638 [6,] 0.5054319 [7,] 0.4992882 [8,] 0.4612303 [9,] 0.4897318 [10,] 0.5016498 Mean: 0.4988088 S.D.: 0.01568156 Mean of the Population is 0.4992135 E(P)=.4988; Population “P”=.4992 E(P)≈P Note, any single sample might be “off”; however, the idea is that there is no systematic tendency to be off one direction or the other.

Sampling Distribution • What we’ve just gone through are simulations of SAMPLING DISTRIBUTONS • Defined: the distribution of a statistic that you obtain from repeated samples of size n from some population.

The Concept of Variance • How far might you be off in a particular sample? • Why, by the way, might you like to know this? • You usually only have ONE sample!! • Is there a way we can determine this degree of variability?

Standard Error of a Proportion • Variance: “Average “squared” deviations • Standard Error: square root of the variance.

Standard Error in Action • Suppose the true population parameter is P. • P=.50 • In repeated samples, you would expect the average sample statistic to approach .50 • Recall prior simulation • What is the “sampling error”? • Using formula from previous slide: • [.5(1-.5)/100]1/2=.05

Interpretation? • If the true population proportion is .50 and we took repeated (random) samples of size 100, the expected value of P would be .50 but the standard deviation would be .05. • .05 is our standard error of the sampling distribution. This is what ought to happen in repeated sampling. • More to it…that comes later.

Put it to the test. > #"The Population" > X<-runif(1000000,.01,.99) > meanX <- mean(X); meanX [1] 0.500889 > sdX<-sd(X); sdX [1] 0.2832314 > > #Sample 100, 1000 times > > set.seed(7324) > nsamp <- 1000 > res <- numeric(nsamp) > for (i in 1:nsamp) res[i] <- mean(sample(X, 100, replace = FALSE)) > mean(res); sd(res) [1] 0.5007463 [1] 0.02781522

Result • What conclusions would I draw from my simulation? • “Best guess” of P is .50. • The average deviation across samples is about .03. • My guess + my error allows me to compute a CONFIDENCE INTERVAL • Estimate +/- Error=C.I.

Confidence Interval • What I’ve really done in my simulation is computed a “68 percent confidence interval.” • .50 plus or minus .03 • 68 percent of all samples give a value for P between (about) .47 and .53 • Classical interpretation: In repeated samples of size 100, the expected value of P will lie in the range .47 to .53, 68 percent of the time. • Why “68 percent”? • 68-95-99.7 Rule and the Normal Distribution

One Sample • You have one sample. • What makes the C.I. big versus small? • The Standard Error • As n goes up, s.e. goes down. • Therefore, C.I. must get smaller.

Illustration

Inference The goal of statistical inference is to make supportable conclusions about the unknown characteristics, or parameters, of a population based on the known characteristics of a sample measured through sample statistics. Any difference between the value of a population parameter and a sample statistic is bias and can be attributed to sampling error.

Inference On average, a sample statistic will equal the value of the population parameter. Any single sample statistic, however, may not equal the value of the population parameter. Consider the sampling distribution: When the means from an infinite number of samples drawn from a population are plotted on a frequency distribution, the mean of the distribution of means will equal the population parameter.

Inference By calculating the standard error of the estimator (or sample statistic), which indicates the amount of numerical variation in the sample estimate, we can estimate confidence. More variation means less confidence in the estimate. Less variation means more confidence.

Implications? • If we want to cut our s.e. in half, we must quadruple the sample size. • N exponentially related to s.e. • S.E. for N=100 is .05 • S.E. for N=400 is .025 • .05/.025=2 • S.E. for N=1000 is .0158 • S.E. for N=4000 is .0079 • .0158/.0079=2 • There are trade-offs between precision and design.

The Scientific Study of Politics (POL 51)