280 likes | 452 Views
SSRC Eurasia Quantitative Methods Webinar Complex Sampling Designs. Professor Jane Zavisca University of Arizona December 10, 2012 janez@u.arizona.edu. Objectives. Be able to identify complex sampling designs and evaluate sample quality
E N D
SSRC Eurasia Quantitative Methods WebinarComplex Sampling Designs Professor Jane Zavisca University of Arizona December 10, 2012 janez@u.arizona.edu
Objectives • Be able to identify complex sampling designs and evaluate sample quality • Know how to adjust for sample design in statistical analysis • To understand sampling designs, focus on the processof selection, not on the resulting characteristics of the sample.
What is a probability sample? • Elements are selected randomly from a sampling frame (list of all elements in population) • Every element in a sampling frame has a known and non-zero chance of selection
Why draw a probability sample? • To represent a population with a sample • Required to use inferential statistics, which are based on probability theory.
Error in the process of sampling Target population Coverage error Sampling frame Webinar focus Sampling error Sample Nonresponse error Respondents Adjustment error Post-survey adjustments Survey statistic Source Groves et al 2009: Fig 2.5
Sampling error • Note: Sample in this context means entities selected into our sample (some may ultimately not respond). • Effects of sampling error • Random error: affects standard error estimates • Systematic error: affects point estimates (e.g. mean, regression coefficient) Sampling error Sampling frame Sample
Simple random sample (SRS) • Every entity in sampling frame has equal, independent,known, and non-zero probability of selection: • SRS = baseline assumption of most stats textbooks • But SRS rarely used in practice
Complex sample designs deviate from SRS • Weight elements with unequal probabilities of selection • Sample clusters ofelements simultaneously • Stratify frame into different groups sampled separately proportionate to size
Most large surveys use “multi-stage samples” • Example of RLMS • Stage 1: Sample regions • Stage 2: Sample cities/rural areas • Stage 3: Sample census tracts • Stage 4: Sample addresses • Stage 5: Sample households • Probability of selection is equal to the product of the probabilities of selection at each stage • Such a design always introduces clustering (and sometimes stratification). Depending on how sampling is done at each stage, it may also introduce unequal selection probabilities.
Simplified example • Wish to sample 100 people from a population of 10 villages. The population of each village is as follows: • Villages 1-4: 50 residents • Villages 5-8: 100 residents • Villages 9-10: 200 residents • The total population across all villages= 1000. • A simple random sample would disregard village, and draw 100 respondents from the list of 1000 people, (sampling fraction = 10%). • However, we can only afford to visit 4 villages. We randomly sample 4 villages, and then sample 25 people within each village to achieve our sample size of 100. • Two issues: • Unequal sampling probabilities (this can be fixed using probability-proportionate to size) • Underestimate of variance (people in same village likely to similar to each other; i.e. variance within village is less than variance within population)
SRS at each stage unequal probabilities across stages • Each village probability of being selected is .1 (1/10) • Each individual’s probability of being selected • Village 1: .4*(25/50)=.2 • Village 2: .4*(25/50)=.2 • Village 3: .4*(25/50)=.2 • Village 4: .4*(25/50)=.2 • Village 5: .4*(25/100)=.1 • Village 6: .4*(25/100)=.1 • Village 7: .4*(25/100)=.1 • Village 8: .4*(25/100)=.1 • Village 9: .4*(25/200)=.05 • Village 10: .4*(25/200) =.05 • With this sampling design, should apply weights, calculated as the inverse of the sampling probability.
Better solution: probability proportionate to size -- > self-weighting sample • Alter the psu probability of being sampled to make individual probabilities equal. • Each village probability of being selected is .1 (1/10) • Each individual’s probability of being selected • Village 1: .4*(50/1000)*(25/50)=.1 • Village 2: .4* (50/1000)*(25/50)=.1 • Village 3: .4*(50/1000)*(25/50)=.1 • Village 4: .4*(50/1000)*(25/50)=.1 • Village 5: .4*(100/1000)*(25/100)=.1 • Village 6: .4*(100/1000)*(25/100)=.1 • Village 7: .4*(100/1000)*(25/100)=.1 • Village 8: .4*(100/1000)*(25/100)=.1 • Village 9: .4*(200/1000)*(25/100)=.1 • Village 10: .4*(200/1000)*(25/100)=.1 • This solves weighting issue, but clustering remains.
3 things you need know to adjust statistical analyses • Were unequal probabilities applied? If so, find the variable identifying sampling weights • Are the sampling units clustered? If so, find the variable identifying primary sampling units • Was the sampling stratified? If so, find the variable identifying strata. • Tips: • Read survey documentation very carefully to identify the sampling design and corresponding variables. • In some datasets it is possible to represent multiple populations (e.g. households vs individuals, 2010 vs 2011). In that case be sure to use the appropriate versions of these variables.
Core stata syntax • Step 1: Tell Stata about the sample design. • svysetpsu [pweight=weight], strata(strata) • Italicized terms are variables whose actual names will vary depending on your dataset. • pweight is the variable containing probability weights • psu is the variable identifying "primary sampling units" (the initial stage of clustering) • strata is the variable indicating the sampling strata • Some complex survey designs involve only some of these elements; none are required. • Other esoteric sampling features (e.g. replicate rates, finite population correction) are less common and won’t be discussed today.
Unequal probability by design • To represent a subpopulation: oversample a relatively small group to ensure ability to make inferences about that subpopulation. • To correct for sampling procedures that lead to unequal probabilities. • E.g. use addresses to sample individuals (individuals who live in small households will be overrepresented). • To exclude individuals who should not be included for inferences to particular subpopulations (e.g. RLMS cross-sections). • Unweighted estimates will be wrong for any variable that is correlated with the determinants of unequal weights.
Example: Soviet Interview Study: unequal probabilities by design • Formal target population: all Soviet emigres to the United States who arrived in 1979-1982 • Substantive (“referent”) target population: sector of Soviet society the survey respondents could represent (adult European urban population) • Sampling frame: “fairly complete” list of 33,618 emigres who arrived from 1979-82 and were between 21 and 70 at time of arrival. • Oversample: all known non-Jews were selected into the sample.
Example: Kaluga Consumption Study • No sampling frame of individuals available • Use list of addresses instead • Randomly select one individual adult from within each sampled address • Select 100 adults from list of 100,000 addresses • SRS of addresses: ph=1000/100,000 = .01 • pi|h =1/s, where s = number of adults in household • Adult who lives alone has a (.01)*(1/1) = .01 chance of selection • Adult who lives with another adult has a (.01)*(1/2) = .005 chance of selection). • Unequal probabilities of selection people who live in small households will be overrepresented.
Why did I choose this design? • There were no adequate lists of adults in Kaluga and their contact information • There was also no adequate list of addresses • Did not want to use random walk method – too difficult to control quality • Needed to enumerate addresses (walk streets and write down) – but could not afford to walk entire city. • Used multistage design to select clusters (electoral districts), and enumerated only those clusters • Increased quality control at higher levels – but still no advance knowledge of household size before address selection. Requires ex post facto weights.
open kworking.dta • get unweighted mean of the number of adults per household • mean num_ad • Which direction should we expect mean to be skewed in unweighted analysis? • What about mean number of adults who are married? What direction of skew expected?
Apply sampling weights. Note “weight” is the variable name for the weight variable in my dataset. • Identify the sampling design for stata • svyset [pw=weight] • Apply sampling design (weights) to mean estimation • svy: mean num_ad [aw=weight] • Note both mean and standard error change • Examine distribution of sampling weights • tab weight • sum weight
Regression and weights • Try running model with and without weights correction predicting frequency of eating fruit (fruitf02) • Some possible variables to include • Age (age) • Age squared (agesq) • Income (adjhhinc) • highed
How sample design affects standard error • Complex sample designs deviate from simple random sample • Clustering (usually increases standard error) • Stratification (usually decreases standard error) • Think of effect on standard error in terms of “effective sample size” • We know that larger N decreases standard error (increases our confidence in our estimates) • Effective sample size = the N in a simple random sample that would produce same degree of standard error as the complex sampling strategy produces.
Effects of clustering on standard error • In multi-stage sample, units within PSUs (primary sampling units) tend to be more similar than units in different PSUs, causing underestimated variance. • Increases standard error over what would have been observed with a simple random sample (the “design effect”)
Kaluga survey • Electoral districts as PSUs • Randomly sampled 25 PSUs (using probability proportionate to size) • Enumerated each sampled district to create complete list of addresses • Sampled 40 addresses within each sampled PSU • Sampled districts within households • Sample is self-weighting to this step –but sampling within households unequal probabilities and need for weights.
Add PSU to svyset • Completely unadjusted: • mean dorm • Adjusted for weights only • Svy: mean dorm • Account for psu • Svyset site [pw=weight] • Svy: mean dorm • What changes?
Stratified sample • proportionately representative of each stratum • stratify population by appropriate criteria • randomly select within each category Example: • Stratify regional PSUs into rural versus urban areas • Select PSUs from each strata (rural vs urban) in proportion to their weight in population.
Postsurvey adjustment for nonresponse • Weight existing respondents to compensate for nonrespondents who were missed. • E.g. give more weight to urban respondents if response rates in urban areas were lower than in rural areas. • Adjusting for unit nonresponse • If know relevant characteristics of sampled persons who were missed. • Poststratification weighting • Makes sure sample distn for post stratification variables matches that of target population. • If know population distribution for variable on which weighting.