300 likes | 430 Views
Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model.
E N D
Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model
B. Ganguli, S. Sen Roy, Dept of Statistics, University of Calcutta, India.M. Naskar National Institute of Research for Jute and Allied Fibre Technology, India.E. J. MalloyDept of Statistics, American University, USA. E. A. EisenDepts of Environmental Health, Harvard University & Environmental Health Sciences, UC, Berkeley, USA.
Motivation • need to simultaneously address the issues of modeling nonlinear dose-response relationships and account for outliers and influential observations that may affect this relationship - common problems in environmental epidemiology • heterogeneity in response to toxic exposures is a possible explanation for outliers in models of the health effects of environmental exposures - may lead to unusual shapes of the dose response observations - for example, healthy survivors may be exposed to the largest exposure levels
Example : Silica Exposure Study(Checkoway et.al., 1997, Amer J. Epidemiology) • Cohort mortality study of 2342 male workers exposed to crystalline silica (cristobalite) in a diatomaceous earth mining and processing facility in California. • Study period : 1942 – 1994. • Worked for atleast 12 months • Mortality excesses detected for • Nonmalignant respiratory diseases (NMRD) • Lung cancer
77 deaths from lung cancer in the cohort during the follow-up period Q. 1:Do the outliers and influential observations occur only at the high extremes of exposure ? Q. 2:How do these outliers/influential observations affect the dose-response relationship. Study using (i) GLM model (ii) deletion diagnostics
Linear Model E(y)= = Xα, α fixed effect Linear Mixed Model Generalized Linear Model E(y) = Xα + Zb, = g() = Xα b random effect g(.) some function of (accounts for correlation) (accounts for non-linearity) Generalized Linear Mixed Model
Generalized Linear Mixed Model • n individuals • response : yi • covariates : xi associated with fixed effectszi associated with random effects • α : p-vector fixed effect • b : q-vector random effect • The fixed effect models the mean of y whereas the random effect governs the variance-covariance structure.
Model E(yi| b) = and Var(yi| b) = ai(i) ai‘s known scalars and (.) known function • linear predictor : i = g(i) = xiα + zib g(.) some known function • b ~ N(0, D), where D = ((jIqj))j=1,…,k , qj = q
Y = + (y - )g() • assuming canonical link W = diag(aig(i)) V = W-1 + ZDZ Q = V-1 - V-1X(XV-1X) -1XV-1
Normal equations • Let Z = [ Z1, …, Zk ] (1) YQZjZjQY = trace(QZjbZj) (2) • Implication :fitting a series of linear models on transforms of original data
Deletion Diagnostic • Delete one observation at a time and re-fit the model. • Observe the differences dfbeta = full-set estimate – deleted-set estimate dffit = full-set predictor – deleted-set predictor • If these are substantially large then the deleted observation has an unusually large impact on the estimates and hence is an influential observation
Q :Then, given n observations, do we need to fit the model (n+1) times to identify the influential observations ? Given that iterative techniques are required to solve the normal equations, even a single fit will take considerable time. So (n+1) fits would be computationally time consuming, particularly if n is large. • No, we simply need to fit the model once with the full-data set. • The dfbeta and dffit can be obtained from the leverages and residuals of this single fit.
Question :How do we know that the dfbeta and dffit are sufficiently large to identify the corresponding observationas an influential observation ? • The expressions can be suitably standardized and critical values can then be derived using simulation techniques.
This study has been concerned with • To derive the dfbeta and dffit for the GLMM. • To derive the impact of deletion on the variance components (generally ignored in such studies). • To study the probabilistic behaviour of the residuals so that variances of dfbeta and dffit can be derived and standardization can be done. • To apply the results on simulated and real-life data sets to assess its performance.
Define • B = (XV-1X) -1XV-1 = [ B1,…,Bn],, Q = [ Q1, …, Qn]
Result : • Standardized residuals : (Cook’s distance)
Application to the Silica Exposure data-set • Cox’s hazard model : h(t|x) = h(t)exp(hisp + f(x)) • t – age at which subject died of lung cancer • hisp – indicator of whether the subject was Hispanic or not • x – cumulative silica exposure • f(.) – unknown smooth function
Outliers and influential observations need not occur at the highest extremes of exposure (in fact, all the observations identified as outliers with regard to the fit correspond to low exposure) • Distinction can be made between outliers/ influential observations which affect the fit and those which affect the variance of the random component (the latter are mostly those with high exposures) • The individual with the highest exposure does not affect the fit but affects the variance component
Log hazard with and without outliers for both fitted values and variance estimates
The outliers in the variance component affect the shape of the hazard function and are generally associated with high exposure levels. • The outliers in the fit do not change the shape too much except for a sharper dip at the higher exposure levels. These outliers can occur even at low exposure levels.
Clustered Data Example(a simulation study) • k clusters (say, hospitals) • data in the form of counts yij ~ Poisson(ij), i=1(l)k, j=1(l)ni log(ij) = bi + xij • all observations in the ith cluster share the same intercept bi • xij subject specific covariate
set • k = 9 • = 0.5 • ni = 100 • bi generated from N(0, 0.5) • xij generated from uniform [0, 1] distribution • yij then generated for clusters 1 & 3-9 • Cluster 2 observations generated using a comparatively high ij = 20
standardized dfbetas clearly identify the observations from Cluster 2 as outliers • the estimated cluster means are expectedly larger when Cluster 2 observations are included as opposed to when they are excluded.