Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

B. Ganguli, S. Sen Roy, Dept of Statistics, University of Calcutta, India.M. Naskar National Institute of Research for Jute and Allied Fibre Technology, India.E. J. MalloyDept of Statistics, American University, USA. E. A. EisenDepts of Environmental Health, Harvard University & Environmental Health Sciences, UC, Berkeley, USA.

Motivation • need to simultaneously address the issues of modeling nonlinear dose-response relationships and account for outliers and influential observations that may affect this relationship - common problems in environmental epidemiology • heterogeneity in response to toxic exposures is a possible explanation for outliers in models of the health effects of environmental exposures - may lead to unusual shapes of the dose response observations - for example, healthy survivors may be exposed to the largest exposure levels

Example : Silica Exposure Study(Checkoway et.al., 1997, Amer J. Epidemiology) • Cohort mortality study of 2342 male workers exposed to crystalline silica (cristobalite) in a diatomaceous earth mining and processing facility in California. • Study period : 1942 – 1994. • Worked for atleast 12 months • Mortality excesses detected for • Nonmalignant respiratory diseases (NMRD) • Lung cancer

77 deaths from lung cancer in the cohort during the follow-up period Q. 1:Do the outliers and influential observations occur only at the high extremes of exposure ? Q. 2:How do these outliers/influential observations affect the dose-response relationship. Study using (i) GLM model (ii) deletion diagnostics

Linear Model E(y)=  = Xα, α fixed effect Linear Mixed Model Generalized Linear Model E(y) = Xα + Zb,  = g() = Xα b random effect g(.) some function of  (accounts for correlation) (accounts for non-linearity) Generalized Linear Mixed Model

Generalized Linear Mixed Model • n individuals • response : yi • covariates : xi associated with fixed effectszi associated with random effects • α : p-vector fixed effect • b : q-vector random effect • The fixed effect models the mean of y whereas the random effect governs the variance-covariance structure.

Model E(yi| b) =  and Var(yi| b) = ai(i) ai‘s known scalars and (.) known function • linear predictor : i = g(i) = xiα + zib g(.) some known function • b ~ N(0, D), where D = ((jIqj))j=1,…,k , qj = q

Y =  + (y - )g() • assuming canonical link W = diag(aig(i)) V = W-1 + ZDZ Q = V-1 - V-1X(XV-1X) -1XV-1

Normal equations • Let Z = [ Z1, …, Zk ] (1) YQZjZjQY = trace(QZjbZj) (2) • Implication :fitting a series of linear models on transforms of original data

Deletion Diagnostic • Delete one observation at a time and re-fit the model. • Observe the differences dfbeta = full-set estimate – deleted-set estimate dffit = full-set predictor – deleted-set predictor • If these are substantially large then the deleted observation has an unusually large impact on the estimates and hence is an influential observation

Q :Then, given n observations, do we need to fit the model (n+1) times to identify the influential observations ? Given that iterative techniques are required to solve the normal equations, even a single fit will take considerable time. So (n+1) fits would be computationally time consuming, particularly if n is large. • No, we simply need to fit the model once with the full-data set. • The dfbeta and dffit can be obtained from the leverages and residuals of this single fit.

Question :How do we know that the dfbeta and dffit are sufficiently large to identify the corresponding observationas an influential observation ? • The expressions can be suitably standardized and critical values can then be derived using simulation techniques.

This study has been concerned with • To derive the dfbeta and dffit for the GLMM. • To derive the impact of deletion on the variance components (generally ignored in such studies). • To study the probabilistic behaviour of the residuals so that variances of dfbeta and dffit can be derived and standardization can be done. • To apply the results on simulated and real-life data sets to assess its performance.

Define • B = (XV-1X) -1XV-1 = [ B1,…,Bn],, Q = [ Q1, …, Qn]

Result : • Standardized residuals : (Cook’s distance)

Application to the Silica Exposure data-set • Cox’s hazard model : h(t|x) = h(t)exp(hisp + f(x)) • t – age at which subject died of lung cancer • hisp – indicator of whether the subject was Hispanic or not • x – cumulative silica exposure • f(.) – unknown smooth function

Cook’s distance for the silica data

Standardized dfbeta residual of variance of random effects

Outliers and influential observations need not occur at the highest extremes of exposure (in fact, all the observations identified as outliers with regard to the fit correspond to low exposure) • Distinction can be made between outliers/ influential observations which affect the fit and those which affect the variance of the random component (the latter are mostly those with high exposures) • The individual with the highest exposure does not affect the fit but affects the variance component

Log hazard with and without outliers for fitted values

Log hazard with and without outliers for both fitted values and variance estimates

The outliers in the variance component affect the shape of the hazard function and are generally associated with high exposure levels. • The outliers in the fit do not change the shape too much except for a sharper dip at the higher exposure levels. These outliers can occur even at low exposure levels.

Clustered Data Example(a simulation study) • k clusters (say, hospitals) • data in the form of counts yij ~ Poisson(ij), i=1(l)k, j=1(l)ni log(ij) = bi + xij • all observations in the ith cluster share the same intercept bi • xij subject specific covariate

set • k = 9 •  = 0.5 • ni = 100 • bi generated from N(0, 0.5) • xij generated from uniform [0, 1] distribution • yij then generated for clusters 1 & 3-9 • Cluster 2 observations generated using a comparatively high ij = 20

Plot of standardized dfbeta residuals

estimated cluster means

standardized dfbetas clearly identify the observations from Cluster 2 as outliers • the estimated cluster means are expectedly larger when Cluster 2 observations are included as opposed to when they are excluded.

Thank you

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

Presentation Transcript

The Nexus Explored: A Generalised Model of Learning Styles

Lecture 7 Model Checking for Linear Mixed Models for Longitudinal Data

A Linear Model of ULAE

A Generalised Model for Valuing Early Stage Technology

Lecture 7 Model Checking for Linear Mixed Models for Longitudinal Data

Generalised linear models

Beyond the Generalized Linear Mixed Model: a Hierarchical Bayesian Perspective

STACK (Linear Stack ) Deletion

Generalized Linear Mixed Model

Genetic Association and Generalised Linear Models

Influential Observations in Regression

Computing Confidence Intervals for Predicting New Observations in the Linear Mixed Model

Generalised linear models

ENCODING NON LINEAR MIXED EFFECTS MODEL

Generalised linear models

Definition and diagnostics for the model-observations comparison

Residuals, outliers, influential observations

Mixed Linear Models

Influential Observations in Regression

Mixed Linear Models

Mixed Linear Models