Understanding Survey Sampling and Statistical Demography

STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals by Jan F. Bjørnstad

Survey sampling: 4 major topics • Traditional design-basedstatisticalinference • 7 weeks • Likelihoodconsiderations • 1 week • Model-basedstatisticalinference • 3 weeks • Missing data - nonresponse • 2 weeks

Statistical demography • Mortality • Life expectancy • Population projections • 2 weeks

Course goals • Give students knowledge about: • planning surveys in social sciences • major sampling designs • basic concepts and the most important estimation methods in traditional applied survey sampling • Likelihood principle and its consequences for survey sampling • Use of modeling in sampling • Treatment of nonresponse • A basic knowledge of demography

But first: Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study • Denoted, with N being the size of the population: U = {1, 2, ...., N} All units can be identified and labeled • Ex: Political poll – All adults eligible to vote • Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more • Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

Sampling design: • The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample • We shall only be concerned with probability sampling • Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. • The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

Basic statistical problem: Estimation • A typical survey has many variables of interest • Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population • Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:

In general, variable of interest: y with yiequal tothe value of y for unit i in the population, and the total is denoted • The typical problem is to estimate t or t/N • Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:

Sources of error in sample surveys • Target population U vs Frame population UF Access to the population is thru a list of units – a register UF . U and UFmay not be the same: Three possible errors in UF: • Undercoverage: Some units in U are not in UF • Overcoverage: Some units in UFare not in U • Duplicate listings: A unit in U is listed more than once in UF • UFis sometimes called the sampling frame

Nonresponse - missing data • Some persons cannot be contacted • Some refuse to participate in the survey • Some may be ill and incapable of responding • In postal surveys: Can be as much as 70% nonresponse • In telephone surveys: 50% nonresponse is not uncommon • Possible consequences: • Bias in the sample, not representative of the population • Estimation becomes more inaccurate • Remedies: • imputation, weighting

Measurement error – the correct value of yiis not measured • In interviewer surveys: • Incorrect marking • interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use • misunderstanding of the question, do not remember correctly.

Sampling «error» • The error (uncertainty, tolerance) caused by observing a sample instead of the whole population • To assess this error- margin of error: measure sample to sample variation • Design approach deals with calculating sampling errors for different sampling designs • One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

The first 3 errors: nonsampling errors • Can be much larger than the sampling error • In this course: • Sampling error • nonresponse bias • Shall assume that the frame population is identical to the target population • No measurement error

Summary of basic concepts • Population, target population • unit • sample • sampling design • estimation • estimator • measure of bias • measure of variance • confidence interval

survey errors: • register /frame population • mesurement error • nonresponse • sampling error

Example – Psychiatric Morbidity Survey 1993 from Great Britain • Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services • Target population: Adults aged 16-64 living in private households • Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen • 200 interviewers, each visiting 90 households

Result of the sampling process • Sample of addresses 18,000 Vacant premises 927 Institutions/business premises 573 Demolished 499 Second home/holiday flat 236 • Private household addresses 15,765 Extra households found 669 • Total private households 16,434 Households with no one 16-64 3,704 • Eligible households 12,730 • Nonresponse 2,622 • Sample 10,108 households with responding adults aged 16-64

Why sampling ? • reduces costs for acceptable level of accuracy (money, manpower, processing time...) • may free up resources to reduce nonsampling error and collect more information from each person in the sample • ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error • much quicker results

When is sample representative ? • Balance on gender and age: • proportion of women in sample @ proportion in population • proportions of age groups in sample @ proportions in population • An ideal representative sample: • A miniature version of the population: • implying that every unit in the sample represents the characteristics of a known number of units in the population • Appropriate probability sampling ensures a representative sample ”on the average”

Alternative approaches for statistical inference based on survey sampling • Design-based: • No modeling, only stochastic element is the sample s with known distribution • Model-based: The values yiare assumed to be values of random variables Yi: • Two stochastic elements: Y = (Y1, …,YN) and s • Assumes a parametric distribution for Y • Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.

Statistical principles of inference imply that the model-based approach is the most sound and valid approach • Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. • Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach

Design-based statistical inference • Can also be viewed as a distribution-free nonparametric approach • The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1, ..., N} • No explicit statistical modeling is done for the variable y. All yi’s are considered fixed but unknown • Focus on sampling error • Sets the sample survey theory apart from usual statistical analysis • The traditional approach, started by Neyman in 1934

Estimation theory-simple random sample SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties:

The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:

usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = 0.9975 n =1000 and N = 5,000,000: 1-f = 0.9998 • effect of changing n much more important than effect of changing n/N

The estimated variance Usually we report the standard error of the estimate: Confidence intervals for mis based on the Central Limit Theorem:

Example – Student performance in California schools • AcademicPerformance Index (API) for all California schools • Basedonstandardized testing of students • Data from all schoolswith at least 100 students • Unit in population = school (Elementary/Middle/High) • Full populationconsistsofN = 6194 observations • Concentrateonthe variable: y = api00 = API in 2000 • Mean(y) = 664.7 with min(y) =346 and max(y) =969 • Data setin R: apipop and y= apipop$api00

Histogram of y population with fitted normal density

Histogram for sample mean and fitted normal densityy = api scores from 2000. Sample size n =10, based on 10000simulations R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)

Histogram and fitted normal densityapi scores. Sample size n =10, based on 10000 simulations

y = api00 for 6194 California schools 10000 simulations of SRS. Confidence level of the approximate 95% CI

For one sample of size n = 100: For one sample of size n = 100 R-code: >s=sample(1:6194,100) > ybar=mean(y[s]) > se=sqrt(var(y[s])*(6194-100)/(6194*100)) > ybar [1] 654.47 > var(y[s]) [1] 16179.28 > se [1] 12.61668

Absolute value of sampling error is not informative when not related to value of the estimate For example, SE =2 is small if estimate is 1000, but very large if estimate is 3 The coefficient of variation for the estimate: • A measure of the relative variability of an estimate. • It does not depend on the unit of measurement. • More stable over repeated surveys, can be used for planning, for example determining sample size • More meaningful when estimating proportions

Estimation of a population proportion pwith a certain characteristic A p = (number of units in the population with A)/N Let yi = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the yi’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

So the unbiased estimate of the variance of the estimator:

Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large:

In this example : n = 1000 and N = 4,000,000 Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) N @ 40,000,000

General probability sampling • Sampling design: p(s) - known probability of selection for each subset s of the population U • Actually: The sampling design is the probability distributionp(.) over all subsets of U • Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0. • The inclusion probability:

Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:

Some results

Estimation theory probability sampling in general Problem: Estimate a population quantity for the variable y For the sake of illustration: The population total

CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Because, typically we have that

Some peculiarities in the estimation theory Example: N=3, n=2, simple random sample

For this set of values of the yi’s:

Let y be the population vector of the y-values. This example shows that is not uniformly best ( minimum variance for all y) among linear design-unbiased estimators Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models In fact, we have the following much stronger result: Theorem: Let p(.) be any sampling design. Assume each yi can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t

Proof: This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible

Determining sample size • The sample size has a decisive effect on the cost of the survey • How large n should be depends on the purpose for doing the survey • In a poll for detemining voting preference, n = 1000 is typically enough • In the quarterly labor force survey in Norway, n = 24000 Mainly three factors to consider: • Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest • Homogeneity of the population. Needs smaller samples if little variation in the population • Estimation for subgroups, domains, of the population

It is often factor 3 that puts the highest demand on the survey • If we want to estimate totals for domains of the population we should take a stratified sample • A sample from each domain • A stratified random sample: From each domain a simple random sample

Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p Let n be the sample size of this stratum, and assume that n/N is negligible Desired accuracy for this stratum: 95% CI for p should be The accuracy requirement:

The estimate is unkown in the planning fase Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 ) F.ex.: With p0 = 0.2: n = 246 In general with accuracy requirement d, 95% CI

Understanding Survey Sampling and Statistical Demography