Post-collection processing of data (continued)

Post-collection processing of data (continued) Survey Research and Design Spring 2006 Class #13 (Week 15)

Today’s objectives • To answer questions you have • To understand the design effect and how to handle it • To understand the concept of weighting and learn how to calculate weights • To begin group presentations Survey Research and Design (Umbach)

Post-collection processing of survey data • Several different steps • Coding • Data entry • Editing • Handling item-missing data • Weighting • Sampling variance estimation • Weighting is a very common post-collection process Survey Research and Design (Umbach)

Probability Sampling • Simple random • Stratified random • Proportional • Nonproportional • Systematic • Cluster Survey Research and Design (Umbach)

Accounting for complex sample design when conducting analyses • Standard errors or design effect resulting from cluster • Weighting What are the implications for analysis and/or external validity? Survey Research and Design (Umbach)

Design effect • Term used to describe effect complex sample design • Measure of departure of the complex design from a simple random sample • Causes a misestimation (usually underestimates) of standard errors • Magnitude of cluster effect can be assessed using ICC Survey Research and Design (Umbach)

Corrective strategies for DEFF • Use of software packages (e.g., AM, SUDAAN, WesVar, SPSS, SAS, STATA) • Most precise • Specify strata and cluster (primary sampling unit-PSU) • Adjust estimated standard errors by known DEFF • Alter alpha criteria Survey Research and Design (Umbach)

Weighting • Why weight? • For a variety of reasons, distributions in our sample may differ markedly from the population; e.g., more females than males • If females differ from males on our survey statistic, estimates will be biased • Hopefully weighting will help to reduce this bias • However, weighting also increases variances, so weighted estimates will be less precise • For your purposes, two weights to consider • Selection weight – takes into account oversampling of subgroups • Nonresponse weight – takes into account differential nonresponse across subgroups Survey Research and Design (Umbach)

Weighting • In order to weight, we need some information on all members of the sample • What information? • For selection weights, use whatever criteria were used for the oversampling • For nonresponse weights, we want variables that are good predictors of nonresponse • If you have a lot of variables, you can use data mining programs to find key predictor variables • Otherwise, rely on previous survey research studies: • Females, older, Whites, high GPA, and high SES are more likely to respond • Think about possible predictors when requesting your sampling frame Survey Research and Design (Umbach)

Selection weights • Suppose we design a survey project so that our sample will be 1,000 IU undergraduates students. • The race/ethnicity breakdown for IU undergrads (N=20,732) is • African American – 2.9% • American Indian/Alaskan Native – 0.3% • Asian/Pacific Islander – 3.3% • Hispanic – 2.3% • White – 88.1% • International – 3.1% • With 1,000 students, we would expect to have only 31 int’l students • But if we want to also analyze int’l students as a subgroup, we would need a larger sample. • So we could oversample int’l students, so that we end up with 400 in our sample. Survey Research and Design (Umbach)

Selection weights • So now our sample is • 600 U.S. students and 400 int’l students • Int’l students are now 40% of our sample instead of 3.1%. • Weights are estimated as the reciprocal of the selection probability ps (sample size/population size): Survey Research and Design (Umbach)

Selection weights • How do these weights make a difference? Suppose mean satisfaction for int’l students is 3.5, but for U.S. students it is only 2.5 • With a sample of 969 U.S. students and 31 int’l students, mean satisfaction for the sample would be • (969*2.5 + 31*3.5)/(969 + 31) = 2.53 • With a sample of 600 U.S. students and 400 int’l students, mean satisfaction for the sample would be • (600*2.5 + 400*3.5)/(600 + 400) = 2.90 • Let’s recalculate the second mean using the selection weights: • (600*2.5*33.48 + 400*3.5*1.61)/(600*33.48 + 400*1.61) = 2.53 • In essence, the 600 U.S. respondents “count for more” when we use the selection weights Survey Research and Design (Umbach)

Nonresponse weights • Several different ways to calculate nonresponse weights (see Kalton article) • Most common is cell weighting • Sample is divided into subgroups based on data external to the survey • Weights are calculated based on the probability of response • Suppose we administer a SRS survey to 1,600 students with an overall response rate of 50%. • Because we have data on gender for the entire sample, we can calculate response rates for males and females: • Males: 800 in sample, 344 responded, response rate = 43% • Females: 800 in sample, 456 responded, response rate = 57% Survey Research and Design (Umbach)

Nonresponse weights • Nonresponse weights are the reciprocal of the probability of response: • Males: 1/.43 = 2.326 • Females: 1/.57 = 1.754 • Remember to double-check your weights by multiplying them by the cell size • Males: 2.326*344 = 800.14 • Females: 1.754*456 = 799.82 • If you have both selection weights and nonresponse weights, multiply them together to get one weight. Survey Research and Design (Umbach)

A warning about weights • You should have noticed that weighting increases your sample size • In the previous example we had a final sample n of 800; using the nonresponse weights for males and females increases this sample size to 1,600 • Some software programs do not take this into account, and use n=1,600 for statistical tests instead of n=800. • Your weighted number of cases should equal your unweighted number of cases • To correct, you need to normalize your weights • Find the mean weight in your sample • Divide all weights by this mean • Weights should now sum to your unweighted n • These are also called relative weights Survey Research and Design (Umbach)

A warning about weights • In the nonresponse example, the mean of the weights is 2.0 • So the normalized weights are • Males: 2.326/2.0 = 1.163 • Females: 1.754/2.0 = 0.877 • Now if we apply the weights to our cell n’s • Males: 1.163*344 = 400.1 • Females: 0.877*456 = 399.9 • These sum to 800, our unweighted sample size • The mean of these new weights should equal 1 (way to double-check) • You should check each procedure to see if you should normalize!! • With SPSS, you generally need to normalize • With SAS, it depends; also, some procedures will normalize for you • See Heck and Thomas article Survey Research and Design (Umbach)

For next class… • Group projects due • Remaining group presentations • Course evaluations Survey Research and Design (Umbach)

Post-collection processing of data (continued)