1 / 36

IV. Selecting Variables

IV. Selecting Variables. How do we go about selecting variables for regression models? In fact, we’ve already spent considerable time on this topic (questions of causality within a multivariate framework).

jenny
Download Presentation

IV. Selecting Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IV. Selecting Variables

  2. How do we go about selecting variables for regression models? • In fact, we’ve already spent considerable time on this topic (questions of causality within a multivariate framework). • Most fundamentally, we should include variables only because within a sound conceptual framework: • We want to find out how they effect the dependent variable. Or • We want to control for their effects on the dependent variable.

  3. So, we include independent variables only within sound conceptual frameworks that lead us to hypothesize that the variables: • Have causal effects on the dependent variable. • Are correlated with each other. (see Allison, pages 49-52)

  4. Let’s keep in mind that properly conducted, randomized experimental design automatically imposes controls. • That is, it automatically ensures that there’s no correlation between the treatment variable and the characteristics of the subjects. (see Allison, page 50)

  5. Today we’ll introduce some variable-selection procedures that most of us do not recommend using (see, e.g., Allison, pages 92-93; Mendenhall & Sincich, chapter 6). • More important, we’ll then examine a non-automated, conceptually guided & systematic approach to selecting variables—the way we should do things.

  6. Automated Procedures • What if we have lots of potential explanatory variables but we have no clear reasons to guide us in selecting them for a model? • An automated approach to the problem is stepwise regression: sw regress y x1 x2 x3…xk, options . use stepwise, clear

  7. . corr x* • . collin x* [a set of collinearity statistics] • Forward stepwise selection: • . sw regress y x1 x2 x3 x4 x5 x6, pe(.99) • Set ‘parameter entry’ to .99 so that all the variables will enter & their p-value order can be observed.

  8. . sw regress y x1 x2 x3 x4 x5 x6, pe(.25) Set ‘parameter entry’ to .25 so only the variables with p-value<=.25 will be retained. Logic of forward stepwise selection: Use stepwise to fit a model of y on the constant. Stepwise adds x1, then x2, then…x6. Stepwise finds the x-variable of the series that is most significant statistically. In our example, if a variable’s significance is <=.25, stepwise keeps it in the model. • pe: ‘eligible for addition’

  9. Backward stepwise selection: • . sw regress y x1 x2 x3 x4 x5 x6, pr(.99) • Set ‘parameter entry’ to .99 so that all the variables will enter & their p-value order can be observed. . sw regress y x1 x2 x3 x4 x5 x6, pr(.25) • Set ‘parameter entry’ to .25 so only the variables with p-value<=.25 will be retained. • pr: ‘eligible for removal’

  10. Logic of backward stepwise selection: Use stepwise to fit a model of y on x1…x6. Stepwise considers dropping x1, then x2, then…x6. Stepwise finds the x-variable that’s most significant statistically. In our example, if a variable’s significance is >.25, stepwise removes it from the model.

  11. Stepwise selection may prove helpful in exploratory data analysis, but it is fraught with serious problems: (1) Most basically, it capitalizes on sample-specific chance: with a large enough pool of variables, the procedure will find statistically significant results by chance, based on the particular sample’s quirks.

  12. (2) In another sample, the procedure is likely to select different variables. (3) And in the case of any sample, stepwise cannot take into account theoretical or practical significance: there’s nothing to keep it from selecting nonsensical variables.

  13. What’s a much more constructive & defensible approach? • Use conceptual criteria to select the pool of possible variables, then use some combination of conceptual & model-fitting criteria to narrow the pool.

  14. A Conceptually based Approach • Part I • How is outcome variable y conceptualized? What are its topic-specific, as well as broader political, social, & cultural, premises (e.g., IQ, race, gender)? In what ways are these valid or not? How do they contribute to the social construction of reality? • What is it about y that needs to be explained, & why? What are the topic-specific, as well as broader political, social, & cultural, premises of the question (e.g., IQ, race, gender)? How do they pertain to the social construction of reality?

  15. Within a solid conceptual framework, what independent variables are likely to have causal effects on the dependent variable? • And what independent variables—having causal effects on the dependent variable as well as correlations with other independent variables—need to be controlled?

  16. Concerning the data for y, arethe sample, the broader study design & procedures, & the variable’s measurement valid or not: • Temporally speaking for the implied X/Y relationship? • Technically speaking for measurement premises (i.e. ordinal or interval quantitative variable, or some kind of categorical variable).

  17. Allison (pages 52-57) lays out other basic questions to ask: • Based on our knowledge of the topic, does the dependent variable affect any of the independent variables? • Reverse causation may bias the independent variables. • There’s not much that can be done about it. • If the problem seems serious enough, re-conceptualize the model.

  18. Are there omitted variables? • This causes bias to some degree or another.

  19. Are the variables measured well? • There’s bias to the degree that they are not measured well. • Greater measurement error: bias tends to be toward zero (i.e. underestimated values) • Less measurement error: bias tends to be further from zero (i.e. overestimated values)

  20. Do some independent variables mediate the effects of others on the dependent variable? • This is a key question for understanding the causal processes within a model. • An independent variable’s total effect=direct effect + indirect effects • Test nested models to find out.

  21. Is there multicollinearity? • VIF>10; tolerance<.1; condition index>15. • If there is multicollinearity, the standard errors become too large . • This makes it harder to detect statistical significance. • Note: multicollinearity is a problem for hypothesis-testing models but not for predictive models.

  22. Taking all of this into account, here’s a series of questions to ask:

  23. Part I • Is the sample—including its size—adequate for the study’s purpose? • What is the dependent variable. How is it defined & measured? Is it well measured? Does it possibly have effects on the explanatory variables? If so, to what degree? • What explanatory variables should be included in the model, & why? To what extent are data on these variables available or collectible? • How is each potential explanatory variable defined & measured? Is it well measured?

  24. Part II • Document, in terms of the literature and your knowledge of the topic, the hypothesized relationship of each explanatory variable to y (see McClendon, chap. 3; Agresti/Finlay, chap. 10; King et al.). • Linear, independent (untransformed quantitative variable)? • Same slope but unequal y-intercepts (dummy variables, including multinomial categorical)? • Critical thresholds (categorical binary or ordinal)?

  25. (4) Increasing or decreasing effect (quadratic or log)? (5) Dependent on the level of another explanatory variable (interactional, meaning unequal slopes)? (6) Or some combination of these?

  26. Part III • Univariate analysis: Graphically & numerically describe y’s distribution: overall pattern & striking deviations—i.e. shape, center, & spread, including notable outliers. • Should y be transformed or not, & why? • Univariate analysis: Graphically & numerically describe the distribution of each explanatory variable: overall pattern & striking deviations—i.e. shape, center, & spread, including notable deviations. • Should any of the explanatory variables be transformed or not, and why?

  27. Part IV • List the explanatory variables (including transformations & interactions) in order of their conceptual importance to y, & explain this order as well as the hypothesized form of each relationship (see McClendon, chap. 3).

  28. Part V • Bivariate analysis: Graphically & numerically describe the bivariate relationship of each explanatory variable to y (see McClendon, pages 107-116; Agresti/Finlay, chap. 10). • Bivariate analysis: Graphically & numerically describe the bivariate relationships of the explanatory variables to each other(see McClendon, pages 107-116; Agresti/Finlay, chap. 10).

  29. Bivariate analysis, controlling another explanatory variable:Graphically & numerically describe the bivariate relationship of each explanatory variable to y, sequentially holding another explanatory variable constant(see McClendon, pages 107-116; Agresti/Finlay, chap. 10). • Estimate a bivariate regression model of each explanatory variable’s relation to y, noting the value and p-value of each coefficient.

  30. Part VI • After completing the preparatory data analysis, estimate & assess multiple regression models: • Estimate a preliminary main effects model (i.e.without curvilinear terms). • Use mark/markout to ensure equal # observations. • How have any bivariate relationships changed? • For now, eliminate the explanatory variables that test insignificant. • (2)Conduct a nested-model F-test, also comparing #observations (which must be equal for the nested tests), Adj R2, & the explanatory variables’ signs & coefficients, standard errors, p-values, & confidence intervals.

  31. (3) One by one drop each explanatory variable & compare the models regarding Adj R2, slope coefficients (direction & size), standard errors, p-values, & confidence intervals. (4) For the time being, drop any insignificant variables: this is the preliminary main effects model. (5) Re-check for any other possible y/x curvilinearities by combining qladder/ladder with any other of the following commands: qladder x1, ladder x1; sparl y x1 (& options); locpoly y x1, de(#); twoway mband y x1, ba(#); lowess y x1, bw(.#); scatter y x1 || qfit (& fpfit) y x1.

  32. (6) Re-estimate the model as necessary to explore curvilinear y/x relationships, comparing Adj R2, coefficients (direction & size), standard errors, p-values, & confidence intervals. (7) Consider & possibly explore whether or not it makes sense to collapse or otherwise revise the categories of categorical variables. (8) Re-estimate the model as necessary to explore revised y/x relationships in terms of collapsed &/or otherwise revised categorical variables.

  33. (9) Consider all possible substantively meaningful interactions. (a) One by one add each such interaction to the model, comparing #observations, Adj R2, & the explanatory variables’ coefficients, standard errors, p-values, & confidence intervals. (b) Add all of the interactions that tested significant to the model; conduct all possible nested model tests, also comparing #observations (which must be equal for the nested tests), Adj R2, & the explanatory variables’ coefficients, standard errors, p- values, & confidence intervals.

  34. (c) Add all the variables that had previously been dropped; conduct all possible nested model tests, also comparing #observations (which must be equal for the nested tests), Adj R2, & the explanatory variables’ coefficients (direction & size), standard errors, p-values, & confidence intervals.

  35. (10) Drop those explanatory variables that are characterized by some combination of weak conceptual relationship with y & statistical insignificance, or which somehow detract from the clarity of the model. (a) Estimate this—the preliminary final model. (b) Conduct the battery of graphic & numerical diagnostic tests.

  36. (11) Re-add any other variables that are conceptually/theoretically relevant. (a) Estimate this—the final, complete model. (b) Conduct the battery of graphic & numerical diagnostic tests. (c) Test nested models (see Mendenhall & Sincich), conducting the diagnostic tests for each model.

More Related