Applied Sampling [ Notes based on Graham Kalton’s Sage Publication and Prof. Jim Lepkowski’s Lecture Notes ]

Applied Sampling[Notes based on Graham Kalton’s Sage Publication and Prof. Jim Lepkowski’s Lecture Notes] Partha Lahiri Joint Program in Survey Methodology University of Maryland, College Park

1. Course Overview Design perspective Historical perspective Population perspective

A. Design Perspective • Experiments: control (C) or use of randomization (R) for disturbing variables D • Quasi-experimental observational studies • Survey samples

B. Historical Perspective • Late 19th century: • Complete enumeration (census) • Monography (purposive selection) • Kaier (1895) proposed the representative method • Eventually known as the sample survey method

Mathematics and Sampling • Bowley (1906) proposed equal chance selection through randomization • Neyman (1934) eventually provided a complete theory for inference • Two general strategies evolved : • Purposive selection (the representative method) • Probability sampling (chance or randomization theory method)

Elements of Sample Surveys • Randomization inference • Representativeness • Finite populations • Large samples • Chance selection: equal/epsem • Stratification to improve precision and administrative control • Clustering

C. Population Perspective • Target or ideal population • Survey population • Sampling frame ┌─────────────────────────────────────────────────────────┐ │Target │ │ ┌───────────────────────────────────────────────────┐ │ │ │Survey │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ │ │Frame │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────────────────────────────┘ │ │ │ └───│────────────|──────────────────────────────────┘ │ └───────|────────────|────────────────────────────────────┘ |_ _ _ _ _ _ |

Survey Sampling Design Typology

Population • N elements labeled i = 1, 2, ... , N • Characteristic denoted Y1, Y2, Y3, ... , YN

Sample • Sample n elements from N,i = 1, 2, . . . , n • Values denoted as y1 , y2 , . . . , yn

Sampling Distributions: Population Elements ──┐ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┤ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┼────┬────┬────┬────┬────┬────┬────┬────┬─────┬────┬────┬────┬────┬──── │ │ │ │ │ │ │ │ │ │ │ │ │ │ 5 10 15 20 25 30 35 40 45 50 55 60 65 Income ($1,000's)

Sampling Distributions: Sample Means ──┐ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┤ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┼────┬────┬────┬────┬────┬────┬────┬────┬─────┬────┬────┬────┬────┬──── │ │ │ │ │ │ │ │ │ │ │ │ │ │ 5 10 15 20 25 30 35 40 45 50 55 60 65 Income ($1,000's)

2. Simple Random Sampling Implementation Inference Sample size determination

A. Implementation • For a population of N, select a sample of n • Random selection using mechanical device • Employ a table of random numbers, using labels i to identify selected units • Not haphazard • Base procedure • SRS – without replacement • SRS seldom used for selection, even from a simple frame • All samples of n distinct elements equally likely

Sample Estimates

Sampling Variance

Population Total • Inflate the sample total to the population:

Proportions • For Yi = 1 or 0:

B. Inference • From a single sample of size n, estimate variability of sample mean across all possible samples of size n: • Standard Error Estimate: • (1 - ) × 100% confidence interval:

C. Sample Size • Sampling error depends very little on f (unless greater than 0.05) • Same precision for n = 1,000 in College Park or the People's Republic of China • Sample size determination in SRS • Specify desired level of precision • Obtain value for element variance • Solve for n

Computation • Desired precision: • Solving for n: • Ignoring the fpc: • Use a guess value for

Illustration: Relative Precision • The desired level of precision can be specified by relative precision: • For example, suppose for P = 0.10 • Obtain sample size as before.

3. Cluster Sampling Clusters and survey costs Variance of the mean Design effect Subsampling clusters roh and sample size Homogeneity and roh Portability of roh Subsample size

A. Clusters and survey cost • Populations often distributed geographically • Cannot afford to create an element frame • Cannot afford to visit n units drawn randomly from the entire area • Cluster selections are used to reduce costs • Select clusters and list elements only for selected clusters • Clusters naturally occurring units: • Seldom equal size

B. Variance of the mean

Clustered population

Notation

Element sample

Cluster sample

Computing formulas

C. Design effect

Features of • (although generally) • If -- the cluster sample is the equivalent of SRS of size • If Deff = B and • The cluster sample is equivalent to an SRS of a elements

Source and nature of  • For human populations in naturally occurring clusters, factors include • Environment (exposure to infectious disease) • Self-selection (poor households in same block) • Interaction (shared attitudes among neighbors) • The size depends on • Characteristic Y (disease status, age) • Nature of clusters (naturally formed) • Size of cluster (blocks, census tracts, counties)

Estimation of  • Proper estimation, especially for multistage stratified samples, is cumbersome • It is useful for design to estimate in a straightforward fashion • Synthetic estimate

Sample size considerations • Implications of > 0 for sample size: • Compute an SRS sample size • For B and (“guesstimate?”), compute • Design effect and confidence intervals: • With (a - 1) d.f. • or

D. Subsampling clusters • Select b < B elements from a clusters epsem : first-stage sampling rate second-stage sampling rate overall sampling rate -SRS of clusters and SRS of elements within cluster • This two-stage cluster sampling design is epsem • An unbiased estimator of population mean:

Variability of

Two-stage cluster sample

Estimation of variance

Approximation • If a/A is negligible, then reduces to This approximation avoids the need to compute • There is an alternative conceptual explanation for using a reduced or simplified variance estimate ...

E. roh and sample size • Effective sample size: size of SRS that would give the same precision as the cluster sample • neff= n/deff • deff = 2.0 and n=1500 then neff = 1500/2 = 750 • Varying subsample size for fixed n changes the number of clusters, deff, and the effective sample size

Effective sample size • Cluster sample with n = 2000, and roh=0.03. • For b = 10, deff = 1 + (10 - 1) (0.03) = 1.27 • neff = 2000/1.27 = 1540 • For b = 20, deff = 1 + (20 - 1)(0.03)= 1.57 • neff= 2000/1.57 = 1250 • For b = 50, deff = 1 + (50 - 1)(0.03) = 2.47 • neff= 2000/2.47 = 800

F. Homogeneity and roh • Sample 10 school classrooms from 1,000 • Each classroom has exactly 24 children • Alternative 1: sample characteristic is dichotomy • Here the intra-class correlation roh = 0.088 • Moderate amount of homogeneity within clusters • Actual sample size n = 240 • Effective sample size:

Nearly Perfect homogeneity • Homogeneity within, heterogeneity among:

Perfect heterogeneity • Heterogeneity within, homogeneity among: • = undefined (total population)

Estimation Design G. Portability of roh

Illustration • Crime victimization survey in a large public housing project • Sample apartments • Responsible adult interviewed about victimizations occurring to any member of the household • A = 400 floors with exactly B = 15 apartments • a = 10 floors selected by SRS • b = 5 apartments selected by SRS

Sample results Sample floor HH’s “touched” by crime 1 4 2 4 3 5 4 5 5 3 6 1 7 0 8 1 9 2 10 1

Estimation

Applied Sampling [ Notes based on Graham Kalton’s Sage Publication and Prof. Jim Lepkowski’s Lecture Notes ]