1.21k likes | 1.54k Views
Applied Sampling [ Notes based on Graham Kalton’s Sage Publication and Prof. Jim Lepkowski’s Lecture Notes ]. Partha Lahiri Joint Program in Survey Methodology University of Maryland, College Park . 1. Course Overview. Design perspective Historical perspective Population perspective.
E N D
Applied Sampling[Notes based on Graham Kalton’s Sage Publication and Prof. Jim Lepkowski’s Lecture Notes] Partha Lahiri Joint Program in Survey Methodology University of Maryland, College Park
1. Course Overview Design perspective Historical perspective Population perspective
A. Design Perspective • Experiments: control (C) or use of randomization (R) for disturbing variables D • Quasi-experimental observational studies • Survey samples
B. Historical Perspective • Late 19th century: • Complete enumeration (census) • Monography (purposive selection) • Kaier (1895) proposed the representative method • Eventually known as the sample survey method
Mathematics and Sampling • Bowley (1906) proposed equal chance selection through randomization • Neyman (1934) eventually provided a complete theory for inference • Two general strategies evolved : • Purposive selection (the representative method) • Probability sampling (chance or randomization theory method)
Elements of Sample Surveys • Randomization inference • Representativeness • Finite populations • Large samples • Chance selection: equal/epsem • Stratification to improve precision and administrative control • Clustering
C. Population Perspective • Target or ideal population • Survey population • Sampling frame ┌─────────────────────────────────────────────────────────┐ │Target │ │ ┌───────────────────────────────────────────────────┐ │ │ │Survey │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ │ │Frame │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────────────────────────────┘ │ │ │ └───│────────────|──────────────────────────────────┘ │ └───────|────────────|────────────────────────────────────┘ |_ _ _ _ _ _ |
Population • N elements labeled i = 1, 2, ... , N • Characteristic denoted Y1, Y2, Y3, ... , YN
Sample • Sample n elements from N,i = 1, 2, . . . , n • Values denoted as y1 , y2 , . . . , yn
Sampling Distributions: Population Elements ──┐ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┤ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┼────┬────┬────┬────┬────┬────┬────┬────┬─────┬────┬────┬────┬────┬──── │ │ │ │ │ │ │ │ │ │ │ │ │ │ 5 10 15 20 25 30 35 40 45 50 55 60 65 Income ($1,000's)
Sampling Distributions: Sample Means ──┐ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┤ │ ─┤ │ ─┤ │ ─┤ │ ─┤ │ ──┼────┬────┬────┬────┬────┬────┬────┬────┬─────┬────┬────┬────┬────┬──── │ │ │ │ │ │ │ │ │ │ │ │ │ │ 5 10 15 20 25 30 35 40 45 50 55 60 65 Income ($1,000's)
2. Simple Random Sampling Implementation Inference Sample size determination
A. Implementation • For a population of N, select a sample of n • Random selection using mechanical device • Employ a table of random numbers, using labels i to identify selected units • Not haphazard • Base procedure • SRS – without replacement • SRS seldom used for selection, even from a simple frame • All samples of n distinct elements equally likely
Population Total • Inflate the sample total to the population:
Proportions • For Yi = 1 or 0:
B. Inference • From a single sample of size n, estimate variability of sample mean across all possible samples of size n: • Standard Error Estimate: • (1 - ) × 100% confidence interval:
C. Sample Size • Sampling error depends very little on f (unless greater than 0.05) • Same precision for n = 1,000 in College Park or the People's Republic of China • Sample size determination in SRS • Specify desired level of precision • Obtain value for element variance • Solve for n
Computation • Desired precision: • Solving for n: • Ignoring the fpc: • Use a guess value for
Illustration: Relative Precision • The desired level of precision can be specified by relative precision: • For example, suppose for P = 0.10 • Obtain sample size as before.
3. Cluster Sampling Clusters and survey costs Variance of the mean Design effect Subsampling clusters roh and sample size Homogeneity and roh Portability of roh Subsample size
A. Clusters and survey cost • Populations often distributed geographically • Cannot afford to create an element frame • Cannot afford to visit n units drawn randomly from the entire area • Cluster selections are used to reduce costs • Select clusters and list elements only for selected clusters • Clusters naturally occurring units: • Seldom equal size
Features of • (although generally) • If -- the cluster sample is the equivalent of SRS of size • If Deff = B and • The cluster sample is equivalent to an SRS of a elements
Source and nature of • For human populations in naturally occurring clusters, factors include • Environment (exposure to infectious disease) • Self-selection (poor households in same block) • Interaction (shared attitudes among neighbors) • The size depends on • Characteristic Y (disease status, age) • Nature of clusters (naturally formed) • Size of cluster (blocks, census tracts, counties)
Estimation of • Proper estimation, especially for multistage stratified samples, is cumbersome • It is useful for design to estimate in a straightforward fashion • Synthetic estimate
Sample size considerations • Implications of > 0 for sample size: • Compute an SRS sample size • For B and (“guesstimate?”), compute • Design effect and confidence intervals: • With (a - 1) d.f. • or
D. Subsampling clusters • Select b < B elements from a clusters epsem : first-stage sampling rate second-stage sampling rate overall sampling rate -SRS of clusters and SRS of elements within cluster • This two-stage cluster sampling design is epsem • An unbiased estimator of population mean:
Approximation • If a/A is negligible, then reduces to This approximation avoids the need to compute • There is an alternative conceptual explanation for using a reduced or simplified variance estimate ...
E. roh and sample size • Effective sample size: size of SRS that would give the same precision as the cluster sample • neff= n/deff • deff = 2.0 and n=1500 then neff = 1500/2 = 750 • Varying subsample size for fixed n changes the number of clusters, deff, and the effective sample size
Effective sample size • Cluster sample with n = 2000, and roh=0.03. • For b = 10, deff = 1 + (10 - 1) (0.03) = 1.27 • neff = 2000/1.27 = 1540 • For b = 20, deff = 1 + (20 - 1)(0.03)= 1.57 • neff= 2000/1.57 = 1250 • For b = 50, deff = 1 + (50 - 1)(0.03) = 2.47 • neff= 2000/2.47 = 800
F. Homogeneity and roh • Sample 10 school classrooms from 1,000 • Each classroom has exactly 24 children • Alternative 1: sample characteristic is dichotomy • Here the intra-class correlation roh = 0.088 • Moderate amount of homogeneity within clusters • Actual sample size n = 240 • Effective sample size:
Nearly Perfect homogeneity • Homogeneity within, heterogeneity among:
Perfect heterogeneity • Heterogeneity within, homogeneity among: • = undefined (total population)
Estimation Design G. Portability of roh
Illustration • Crime victimization survey in a large public housing project • Sample apartments • Responsible adult interviewed about victimizations occurring to any member of the household • A = 400 floors with exactly B = 15 apartments • a = 10 floors selected by SRS • b = 5 apartments selected by SRS
Sample results Sample floor HH’s “touched” by crime 1 4 2 4 3 5 4 5 5 3 6 1 7 0 8 1 9 2 10 1