280 likes | 403 Views
Weighting and imputation. PHC 6716 July 13, 2011 Chris McCarty. Weighting. Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions
E N D
Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty
Weighting • Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions • Before weighting the implied weight of each observation is 1.0 • After weighting, some observations will have weights >1.0 and some <1.0, and some at 1.0 • No observations should have a weight of 0 • Two general types of weighting: • Design weights -- Adjusting for differences due to intentional disproportionate sampling (e.g. over-sampling African Americans or certain regions) • Post-stratification weights -- Adjusting for differences in population or households when release of sample is intended to be representative (e.g. adjustments for non-response of young people)
Common sources for calculating weights • U.S. Census • Current Population Survey • American Community Survey • For Florida County, Age, Race, Ethnicity the BEBR Population Program
How frequency procedures use weights • All statistical packages have options on procedures to incorporate weights • For frequency procedures the weights are multiplied by the unweighted frequencies, then percentages are calculated on the result
Original and adjusted frequency of Employment variable *This is the sum of the weights for the category
Notes on weighting • Typically you don’t want weights to make enormous differences • Keep in mind that with weighting you are saying you have information extraneous to the survey process that informs you of the proper distribution • You could conceivably up-weight results from a small sample strata • Weights are typically used for accurate estimates of prevalence • Models where you test relationships do not need weights if you include the variables you would use to weight
Weighting with more than one variable • Combined weight with multiplication • Create individual weights for each variable then multiply weights to get a single weight (Wage*Wgender) • Not a good solution with a lot of variables • Combine weights iteratively • Calculate weight for a variable using frequency table • Use that weight in frequency of second variable to create weight • Use that weight in frequency of third variable to create weight • And so on
Consumer Confidence • Survey of approximately 500 Florida households each month • RDD Landline Survey • Five questions (components) averaged into an Overall Index • Until now only post-stratification weighting by proportion of households by county
Potential weighting variablesCounty • Typically we get under-representation from large south Florida counties (Miami-Dade) and over-representation from northern counties (Alachua) • Household proportions are estimated between census years by BEBR • Weights June 2011.xls
Potential weighting variablesAge • RDD tends to lead to over-sampling of seniors with landlines • Cell phones emerged as a problem around 2005 • No reliable age group data until 2010 Census • Elderly tend to be less confident than younger respondents due to fixed incomes • Weights June 2011.xls
Potential weighting variablesHispanic Ethnicity • Cell phones tend to be used disproportionately by Hispanics • 2010 Census provided reliable data about proportion of Hispanic Floridians • Hispanics tend to have lower confidence than non-Hispanics • Weights June 2011.xls
Potential weighting variablesGender • Cell phones are disproportionately used by young males • Monthly CCI uses youngest male/oldest female respondent selection • Weights June 2011.xls
Example 2- FHIS • The state of Florida wanted to estimate rates of the uninsured • They stratified the state into 17 regions and wanted to be able to make estimates for the state and each region with a tolerable margin of error • On the state level they wanted to be able to say something about Blacks, Hispanics and those under 200 percent of the poverty level • Data on these demographics for each Florida telephone exchange were obtained prior to sampling • Strata were created from exchanges • This made it possible to create weights based on known households in each exchange • This required design weights to adjust for disproportionate sampling
Example 3 – Medicaid survey • The state wanted to evaluate Medicaid Reform being conducted in Duval and Broward counties • They wanted to administer a modified CAHPS instrument to Adults and Children separately • They wanted to stratify by plan as well, sampling a minimum number of observations per plan • In the end they wanted to compare plans, counties and adults and children • These weights required knowledge about total enrollment for each one of these characteristics (plan, age, county)
Imputation • Like weighting, imputation involves adjusting the analysis after data collection • Unlike weighting, imputation is the deliberate creation of data that were not actually collected • The main reason for imputation is to retain observations in a statistical analysis that would otherwise be left out • Your ability to discover significant results may be compromised by too many missing values • In some case there may be systematic bias associated with missing data so that not imputing presents an unrepresentative result
Imputation and regressions • Imputation is particularly common when data are analyzed with regression analysis • A regression model explains the variability in a dependent variable using one or more independent variables • Observations can only be included in the regression if they have values for all variables in the model • Models with a lot of variables increase the probability that an observation will have at least one missing value for them
Example ModelIncome = β1(Age) + β2(Education)+ β3(Employed)
Imputation algorithms • Two general categories • Random imputation assigns values randomly, often based on a desired statistical distribution • Deterministic imputation typically assigns values based on existing knowledge • Existing knowledge could be in the data • Single imputation fills missing data with one value, such as the mean of all non-missing values for a continuous variable • Multiple imputation fills in missing data with a set of plausible values • Hot deck imputation fills in missing values with those of an observation that matches on key variables