330 likes | 477 Views
The odd one(s) out: Using Cook’s distance and Dfbetas to uncover influential cases in multilevel models Manfred te Grotenhuis (Radboud University Nijmegen), Cologne, SocLife February 8, 2012. Topics: What are influential cases? [outliers, residuals, leverage, Cook’s distance, Dfbetas]
E N D
The odd one(s) out: Using Cook’s distance and Dfbetas to uncover influential cases in multilevel models Manfred te Grotenhuis (Radboud University Nijmegen), Cologne, SocLife February 8, 2012 • Topics: • What are influential cases? [outliers, residuals, leverage, Cook’s distance, Dfbetas] • Influential cases in multilevel analyses[summary of doing multilevel analysis, detecting influential cases in multilevel analysis] • Examples, tips & tricks
Influential cases: residuals Leverage: distance between mean of X and observation of XResidual: difference between observed and predicted valueProblem: depends on scaling of Y. Solution: standardize: SRESID (studentized residual) or even better SDRESID (studentized deleted residual).
Influential cases: the combination of fit and location Leverage: distance between mean of X and observation of XA high leverage has the potential to influence the b-coefficient, BUT that depends on SRESID!!
Influential cases: COOK’s Distance • To summarize: influential cases determine the regression slope estimate more than average. • It depends on 2 things: • - how well does this case fit in the model? (SRESID) • The ‘extremeness’ of the observation (LEVERAGE) • The combination of these two statistics is called COOK’s DISTANCE: With p = number of x-variables and hi = leverage Critical values: 4 / n or ‘gap’ in absolute value (I will address that on slide 10 & 12) MEANING of COOK’s DISTANCE: to what extend does the total of b-estimates (the b-vector) change when we leave out case j.
Influential cases: DfBetas COOK’s DISTANCE: to what extend does the TOTAL of b-estimates (the b-vector) change when we leave out case j. One drawback: it says TOTAL, but suppose you are interested in one b-estimate only? Solution: calculated Cook’s D for only one variabele. This is called DfBetas. Step 1: estimate the model and save all b-estimates Step 2: estimate the model again but this time leave out case j, and save all b-estimates again + standard errors. Step 3: take the difference between the b-estimates, divided by its standard error: Critical value: 2/√n and/or gap
What are influential cases: summary Influential cases: observations that influence the b-estimates in a undesirable way. In a normal situation every observation individually had no (or almost no) influence on the estimate: it does not matter whether we have it in the model or out. If an observation is influential it does alter the estimates too much in two possible ways: 1) It can alter the total of estimates too much: look at Cook’s Distance 2) It can alter one particular estimate too much (may go unnoticed by Cook’s D): look at Dfbetas. Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issue (save severe violations of linearity)
Influential cases: many cases, less influence Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issues (save severe violations of linearity)
Influential cases: Cook’s D in case of large sample, linearity DfbetaS in case of no harmful influential cases (based on slide #9) ‘gap’ Cut-off: 2/√1200
Influential cases: severe non-linearity Influential cases can occur everwhere, but when samples are large (n>500), it is not much an issues (save severe violations of linearity):
Influential cases: Cook’s D in case of non-linearity Dfbetas based on non-linear relationship (based on slide #11) Luxemburg ‘gap’ Cut-off: 2/√180
Influential cases: first steps in multilevel analyses, variances Before we continue with influential cases, I will summarize some basics in multilevel analyses
Influential cases: first steps in multilevel analyses, variances First we like to know whether there is variation WITHIN Level 2 units (say schools) and variation BETWEEN schools:
Influential cases: Multilevel Equations equation for the within-variance: Y = a + e ij where Y = dependent variable, a = intercept, e ij = within variance (error in regression) i=individual, j=level2 (school) Second we have an equation for the between-variance: a = B 0j + u 0j where B0j = intercept, u j = between-variance Substituting a in the first equation gives: Y = B0 j + e ij + u 0j A multilevel null (empty) model !!! So in plain words: all individuals scores (Y ij) depend upon some figure B0 j plus some individual variation (e ij ) plus some level 2 variation (u 0j).
Influential cases: intra class correlation We have data from 23 schools (519 pupils) and we have a math test als Y variabele. we like to know the between en within variances. * SPSS syntax: mixed math /random intercept | subject(school) covtype(un) /print solution g testcov /method ml. ICC= 24.85/ (24.85 + 81.23) = .23 16
Influential cases: level 1 variables can reduce level 2 variance Being white increases Grade by 4.7! 18.67 while it was 24.849in the null model Level 1 variance is also reduced (was 81.238) Note that -2 loglikelihood dropped to 3785.2 was 3800.776 18
Influential cases: add level 2 variables Of course one will introduce more level 1 variables. First, to test hypotheses, second to find a (partly) explanation for level 2 variance. After we included all relevant level 1 predictors we move on to step number 2: introduce level 2 variables: here the % whites
Influential cases: cross level interactions Now we have done 3 things: 1) estimate ICC, 2) included level 1 variables, 3) include level 2 variables. But now…. We might think that the effect of homework varies across schools: 20
Influential cases: cross level interactions Effect of homework set random across schools:
Influential cases: cross level interactions Ok, so we know that the effect of homework depends on what school you are in: in some schools the effect is low, while in others it is high. WHY is that? One explanation: it is so because in private schools quality of staff is higher so making homework is less important. This would imply that the effect of homework is stronger when school is public: Effect of homework when school = private equals .914. When public the effect doubles = .914 + .917 = 1.831!! • more basic info about multilevel: http://www.ru.nl/mt/ml/summer/
Influential cases: level 2 units In multilevel analysis the problem of influential cases most likely will occur at the higher level with n decreasing. One problem to get SRESID and LEVERAGE from standard matrix algebra for other than OLS regression. Snijders & Berkhof [Diagnostic checks for multilevel models. In J. De Leeuw and E. Meijer, editors, Handbook of Multilevel Analysis, chapter Diagnostic checks for multilevel models, pages 141–175. Springer, 2008] provide an elegant solution (although a bit computer time consuming). Solution: leave all units J out of the analysis one at the time and re-estimate the model and use the outcomes to calculate COOK’s Distance and Dfbetas. This is a flexible approach (one can even change the number of variables for Cook’s distance (default all)).
Influential cases: level 2 units • After detecting an influential case one can go two ways: • 1) leave unit J out of the model and start searching for influential cases again. • 2) Create a dummy for unit J and set it zero on the intercept vector (I. Langford and T. Lewis. Outliers in multilevel data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 161:121–160, 1998.),start searching for influential cases again. • We provide two macro’s to use Cook’s Distance and Dfbetas: • In MLwiN (it works ok, but laboursome and computer time-consuming)(you can find it on http://www.ru.nl/mt/ic/downloads/) • In R (it works ok, easy-to-use, and rather fast). You can find it here: http://www.rensenieuwenhuis.nl/r-project/influenceme/
Influential cases: easy to use first aid Before providing results from the macros, just a few words and pictures.Many scientists (among them Ph.D’s ;-)) start out with complex analyses. But what is wrong with plotting bivariate relationships?
Influential cases: easy-to-use 1st aid OLS regression on aggregated data (from slide #24)… Tanzania
Influential cases: mlwin output from macro Dfbetas for all 3 contextual variables: 90 = Tanzania! Cook’s Distance + significance:
Influential cases: using R macro Another example using R macro, the relationship betwee class structure and outcomes of a test in mathematics, bivariate:
Influential cases: using R macro Before excluding school 7472 After excluding school 7472