Myoung Ho Lee

Myoung Ho Lee STATISTICAL METHODS FOR REDUCING BIAS IN WEB SURVEYS 13rdSeptember2012

Introduction • Web surveys • Methodology - Propensity Score Adjustment - Calibration (Rim weighting) • Case Study • Discussion and Conclusion Contents

Trends in Data Collection Paper and Pencil => Telephone => Computer => Internet (Web) • Internet penetration Introduction

Pros and Cons of Web surveys • Pros - Low cost and Speed - No interviewer effect - Visual, flexible and interactive - Respondents convenience • Cons - Quality of sample estimates • Web surveys may be solutions! But, Problems!!! Introduction

Previous Studies • Harris Interactive (2000 ~ ) • Lee (2004), Lee and Valliant (2009) • Hur and Cho (2009) • Bethlehem (2010), etc. • Lee and Valliant (2009) : good performance in simulation • But, most other results do not seem to be so good. - Malhotra and Krosnick (2007), Huh and Cho (2009) Introduction

Volunteer Panel Web Survey Protocol (Lee, 2004) Under-coverage Self-selection Non-response • Challenge: Fix anticipated biases in web survey estimates that result from under-coverage, self-selection and non-response Web surveys

Proposed Adjustment Procedure for Volunteer Panel Web surveys (Lee, 2004) Methodology

Propensity Score Adjustment (PSA) • Original idea : Comparison of two groups, treatment and control, in observational studies (Rosenbaum and Rubin, 1983) - by weighting using all auxiliary variables that are thought to account for the differences • In context of web surveys, this technique aims to correct for differences between offline people and online people - by certain inclinations of people who participate in the volunteer panel web survey Methodology

“Webographic” : overlapping variables between web and reference survey - To capture the difference between online and offline populations (Schonlau et al., 2007) - For example, “Do you feel alone?”, “In the last month have you read a book?”…… (Harris Interactive) Methodology

Propensity score : It is assumed that ziare independent given a set of covariates (xi) • ‘Strong ignorability assumption’ : Response variable is conditionally independent of treatment assignment given the propensity score. Methodology

Logistic regression model : • Variable Selection • Include variables related to not only treatment assignment but also response in order to satisfy the ‘strong ignorability assumption’ (Rosenbaum and Rubin, 1984; Brookhart et al., 2006) Methodology

Variable Selection • In practice, stepwise selection method has been often used to develop good predictive models for treatment assignment • Most previous web studies : Use of all available covariates (5-30) • Huh and Cho (2009) : 9 or 7 out of 123 covariates were chosen by their “subjective” views Methodology

Variable Selection • Stepwise logistic regression using SIC - large number of covariates, little theoretical guidance • LASSO (PROC GLMSELECT in SAS) - a good alternative to stepwise variable selection • Boosted tree (“gbm” in R) - determine a set of split conditions Methodology

Applying methods for PSA • Inverse propensity scores as weights - weights : - then, multiply them with sampling weights • Subclassification (Stratification) - subgrouping homogenous people into each stratum Methodology

Subclassification (Stratification) • Combine both reference and web data into one • Estimate each propensity score from the combined sample • Partition those units into C subclasses according to ordered values, where each subclass has about the same number of units • Compute adjustment factor, and apply to all units in the cth subclass. • Multiply the factor with sampling weights to get PSA weights Methodology

Calibration (Rim weighting) • Matching sample and population characteristics only with respect to the marginal distributions of selected covariates • Little and Wu (1991) - Iterative algorithm to alternatively adjust weights according to each covariates’ marginal distribution until convergence Methodology

Case Study • Reference survey : “2009 Social Survey” by Statistics Korea - Culture & Leisure, Income & Consumption, etc. - All persons aged +15 in 17,000 households - Sample size : 37,049 - Face-to-face mode - Post-stratification estimation - Assumed to be “True” Case Study

Web survey • Recruiting volunteers from web sites (6,854 households) • Systematic sampling with non-equal selection probabilities (inverse of rim weights using region, age, gender) • Sample size : 1,500 households and 2,903 respondents • Overlapping covariates : 123 Case Study

M1 = Stepwise(22), M2 = Stepwise(17), M3 = LASSO(12), M4 = Boosted tree(18) Case Study – Model Selecion

Assessment methods • 16 combinations : (Model 1, 2, 3 and 4) × (Inverse weighting and Subclassification) × (No Calibration and Rim weighting) • 12 response variables • Percentage of bias reduction Case Study

Percentage of bias reduction PSA alone Calibration Inverse weighting Subclassification Inverse weighting Subclassification M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

Why PSA doesn’t work well alone ??? Propensity scores for each survey in 5 strata in Model 1 Discussion

What are the possible solutions to fix poor PSA? • Setting maximum value of weight • Different subclassification algorithm - Formula for the variance of weights that depends on both the number of cases from each group within a stratum and the variability of propensity scores with the stratum • Matching PSA - limited number of treated group members and a larger number of control group members Discussion

Violation of some assumptions - ‘Strong ignorability assumption’ - Missing at random (MAR) - Mode effects • Variable selection (What are webographic variables?) - Models affect the performance of PSA significantly - Maybe expert knowledge, not statistical approach - Further studies are needed Discussion

Web surveys have attractive advantages • However, bias from self-selection, under-coverage, non-responses • According to my case study results, => It seems to be difficult to apply PSA to “real world” just now • Further researches on webographic variables and different PSA methods are needed Conclusion

Myoung Ho Lee

Myoung Ho Lee

Presentation Transcript

Heigh-Ho, Heigh-Ho

HO HO OH!

Ho, ho, ho! Merry Christmas, (CHILD’S NAME)!

Ho, ho, Watanay Ho, ho, Watanay Ho, ho, Watanay Ki-yo-ki-na Ki-yo-ki-na

Jun Yong Lee, M.D., Sung-No Jung, M.D., Ho Kwon, M.D.

Ho !

NSC project for integrated research on Ho-Ching Lee

HO, HO, HO, HOSANNA

Christmas Logic! Ho Ho!

Ho, Ho, Ho!

Santa Ho Ho

Ho! Ho! Ho!

Robert H. Wagoner, PI, Myoung-Gyu Lee, Hojun Lim Department of Materials Science and Engineering

Hun Myoung Park, Ph.D.

Sang-Ho Lee*, Hong-Bae Moon, Chang-Soo Kim

Ho Ho Ho! Merry Christmas

Speaker: Yu-han Lee Advisor: Yn-ho Huang Ph.D

Lee min ho

“HO”

Kaz Sobczak 1 , Joyce Nacario 2 , Ho Lom Lee 3