1 / 33

The odd one(s) out:

The odd one(s) out: Using Cook’s distance and Dfbetas to uncover influential cases in multilevel models Manfred te Grotenhuis (Radboud University Nijmegen), Cologne, SocLife February 8, 2012. Topics: What are influential cases? [outliers, residuals, leverage, Cook’s distance, Dfbetas]

Download Presentation

The odd one(s) out:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The odd one(s) out: Using Cook’s distance and Dfbetas to uncover influential cases in multilevel models Manfred te Grotenhuis (Radboud University Nijmegen), Cologne, SocLife February 8, 2012 • Topics: • What are influential cases? [outliers, residuals, leverage, Cook’s distance, Dfbetas] • Influential cases in multilevel analyses[summary of doing multilevel analysis, detecting influential cases in multilevel analysis] • Examples, tips & tricks

  2. Influential cases: OLS regression

  3. Influential cases: observed en estimated values

  4. Influential cases: residuals Leverage: distance between mean of X and observation of XResidual: difference between observed and predicted valueProblem: depends on scaling of Y. Solution: standardize: SRESID (studentized residual) or even better SDRESID (studentized deleted residual).

  5. Influential cases: the combination of fit and location Leverage: distance between mean of X and observation of XA high leverage has the potential to influence the b-coefficient, BUT that depends on SRESID!!

  6. Influential cases: COOK’s Distance • To summarize: influential cases determine the regression slope estimate more than average. • It depends on 2 things: • - how well does this case fit in the model? (SRESID) • The ‘extremeness’ of the observation (LEVERAGE) • The combination of these two statistics is called COOK’s DISTANCE: With p = number of x-variables and hi = leverage Critical values: 4 / n or ‘gap’ in absolute value (I will address that on slide 10 & 12) MEANING of COOK’s DISTANCE: to what extend does the total of b-estimates (the b-vector) change when we leave out case j.

  7. Influential cases: DfBetas COOK’s DISTANCE: to what extend does the TOTAL of b-estimates (the b-vector) change when we leave out case j. One drawback: it says TOTAL, but suppose you are interested in one b-estimate only? Solution: calculated Cook’s D for only one variabele. This is called DfBetas. Step 1: estimate the model and save all b-estimates Step 2: estimate the model again but this time leave out case j, and save all b-estimates again + standard errors. Step 3: take the difference between the b-estimates, divided by its standard error: Critical value: 2/√n and/or gap

  8. What are influential cases: summary Influential cases: observations that influence the b-estimates in a undesirable way. In a normal situation every observation individually had no (or almost no) influence on the estimate: it does not matter whether we have it in the model or out. If an observation is influential it does alter the estimates too much in two possible ways: 1) It can alter the total of estimates too much: look at Cook’s Distance 2) It can alter one particular estimate too much (may go unnoticed by Cook’s D): look at Dfbetas. Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issue (save severe violations of linearity)

  9. Influential cases: many cases, less influence Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issues (save severe violations of linearity)

  10. Influential cases: Cook’s D in case of large sample, linearity DfbetaS in case of no harmful influential cases (based on slide #9) ‘gap’ Cut-off: 2/√1200

  11. Influential cases: severe non-linearity Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issues (save severe violations of linearity):

  12. Influential cases: Cook’s D in case of non-linearity Dfbetas based on non-linear relationship (based on slide #11) Luxemburg ‘gap’ Cut-off: 2/√180

  13. Influential cases: first steps in multilevel analyses, variances Before we continue with influential cases, I will summarize some basics in multilevel analyses

  14. Influential cases: first steps in multilevel analyses, variances First we like to know whether there is variation WITHIN Level 2 units (say schools) and variation BETWEEN schools:

  15. Influential cases: Multilevel Equations equation for the within-variance: Y = a + e ij where Y = dependent variable, a = intercept, e ij = within variance (error in regression) i=individual, j=level2 (school) Second we have an equation for the between-variance: a = B 0j + u 0j where B0j = intercept, u j = between-variance Substituting a in the first equation gives: Y = B0 j + e ij + u 0j  A multilevel null (empty) model !!! So in plain words: all individuals scores (Y ij) depend upon some figure B0 j plus some individual variation (e ij ) plus some level 2 variation (u 0j).

  16. Influential cases: intra class correlation We have data from 23 schools (519 pupils) and we have a math test als Y variabele. we like to know the between en within variances. * SPSS syntax: mixed math /random intercept | subject(school) covtype(un) /print solution g testcov /method ml. ICC= 24.85/ (24.85 + 81.23) = .23 16

  17. Influential cases: multilevel null model in MLwiN

  18. Influential cases: level 1 variables can reduce level 2 variance Being white increases Grade by 4.7! 18.67 while it was 24.849in the null model Level 1 variance is also reduced (was 81.238) Note that -2 loglikelihood dropped to 3785.2 was 3800.776 18

  19. Influential cases: add level 2 variables Of course one will introduce more level 1 variables. First, to test hypotheses, second to find a (partly) explanation for level 2 variance. After we included all relevant level 1 predictors we move on to step number 2: introduce level 2 variables: here the % whites

  20. Influential cases: cross level interactions Now we have done 3 things: 1) estimate ICC, 2) included level 1 variables, 3) include level 2 variables. But now…. We might think that the effect of homework varies across schools: 20

  21. Influential cases: cross level interactions Effect of homework set random across schools:

  22. Influential cases: cross level interactions Ok, so we know that the effect of homework depends on what school you are in: in some schools the effect is low, while in others it is high. WHY is that? One explanation: it is so because in private schools quality of staff is higher so making homework is less important. This would imply that the effect of homework is stronger when school is public: Effect of homework when school = private equals .914. When public the effect doubles = .914 + .917 = 1.831!! • more basic info about multilevel: http://www.ru.nl/mt/ml/summer/

  23. Influential cases: level 2 units In multilevel analysis the problem of influential cases most likely will occur at the higher level with n decreasing. One problem to get SRESID and LEVERAGE from standard matrix algebra for other than OLS regression. Snijders & Berkhof [Diagnostic checks for multilevel models. In J. De Leeuw and E. Meijer, editors, Handbook of Multilevel Analysis, chapter Diagnostic checks for multilevel models, pages 141–175. Springer, 2008] provide an elegant solution (although a bit computer time consuming). Solution: leave all units J out of the analysis one at the time and re-estimate the model and use the outcomes to calculate COOK’s Distance and Dfbetas. This is a flexible approach (one can even change the number of variables for Cook’s distance (default all)).

  24. Influential cases: level 2 units • After detecting an influential case one can go two ways: • 1) leave unit J out of the model and start searching for influential cases again. • 2) Create a dummy for unit J and set it zero on the intercept vector (I. Langford and T. Lewis. Outliers in multilevel data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 161:121–160, 1998.),start searching for influential cases again. • We provide two macro’s to use Cook’s Distance and Dfbetas: • In MLwiN (it works ok, but laboursome and computer time-consuming)(you can find it on http://www.ru.nl/mt/ic/downloads/) • In R (it works ok, easy-to-use, and rather fast). You can find it here: http://www.rensenieuwenhuis.nl/r-project/influenceme/

  25. Influential cases: easy to use first aid Before providing results from the macros, just a few words and pictures.Many scientists (among them Ph.D’s ;-)) start out with complex analyses. But what is wrong with plotting bivariate relationships?

  26. Influential cases: easy-to-use 1st aid OLS regression on aggregated data (from slide #24)… Tanzania

  27. Influential cases: mlwin output from macro Dfbetas for all 3 contextual variables: 90 = Tanzania! Cook’s Distance + significance:

  28. Influential cases: mlwin output from macro

  29. Influential cases: using R macro Another example using R macro, the relationship betwee class structure and outcomes of a test in mathematics, bivariate:

  30. Influential cases: using R macro, dfbetas

  31. Influential cases: using R macro, Cook’s D

  32. Influential cases: using R macro Before excluding school 7472 After excluding school 7472

  33. The odd one(s) out: EndeDanke fürs zuhören!

More Related