200 likes | 432 Views
Linear Modelling III. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis . Comparing non-nested models Building Models Automatically Mixed Models. Comparing Non-nested models. There is no equivalent to the partial F test when comparing non-nested models.
E N D
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics
Synopsis • Comparing non-nested models • Building Models Automatically • Mixed Models
Comparing Non-nested models • There is no equivalent to the partial F test when comparing non-nested models. • The main issue is how to compensate for the fact that a model with more parameters will tend to fit the data better (in the sense of explaining more variance) than a model with fewer parameters • But a model with many parameters tends to be a poorer predictor of new observations, because of over-fitting • However, several criteria have been proposed for model comparison • AIC (Akaike Information Criterion) where L = the likelihood for the data, p= # parameters For Linear Models, where n is the number of observations and L is the maximized value of the likelihood function for the estimated model. Among models with the same number of observations n, choose the model which minimises the simplified AIC
BIC (Bayesian Information Criterion) • The BIC penalizes p more strongly than the AIC - ie it prefers models with fewer paramters • In R: f <- lm( formula, data) aic <- AIC(f) bic <- AIC(f, k = log(nrow(data)))
Building Models Automatically • Example: • We wish to find a set of SNPs that jointly explain variation in a quantitative trait • There will be (usually) far more SNPs s than data points n • There are a vast number 2s of possible models • No model can contain more than n-1 parameters (model saturation) • Forward Selection: • Start with a small model and augment the model step by step so long as the improvement in fit satisfies a criterion (eg AIC or BIC, or partial F test) • At each step, add the variable which maximises the improvement in fit • Backwards Elimination: • Start with a very large model and subtract terms from the model step by step • At each step, delete the variable that minimises the decrease in fit • Forward-Backward • At each step, either add or delete a term depending on which optimises a criterion • in R, stepAIC • Model Averaging • Rather than try to find a single model, integrate over many plausible models • Bootstrapping • Bayesian Model Averaging
Random EffectsMixed Models • So far our models have had fixed effects • each parameter can take any value independently of the others • each parameter estimate uses up a degree of freedom • models with large numbers of parameters have problems • saturation - overfitting • poor predictive properties • numerically unstable and difficult to fit • in some cases we can treat parameters as being sampled from a distribution – random effects • only estimate the parameters needed to specify that distribution • can result in a more robust model
Example of a Mixed Model • Testing genetic association across a large number of of families • yi = trait value of i’th individual • bi = genotype of individual at SNP of interest • fi = family identifier (factor) • ei = error, variance s2 • y = m + b + f + e • If we treat family effects f as fixed then we must estimate a large number of parameters • Better to think of these effects as having a distribution N(0,t2) for some variance t which must be estimated • individuals from the same family have the same f and are correlated • cov = t2 • individuals from different families are uncorrelated • genotype parameters b still treated as fixed effects • mixed model
Mixed Models • y = m + b + f + e • Fixed effects model • E(y) = m + b + f • Var(y) = Is2 • I is identity matrix • Mixed model • E(y) = m + b • Var(y) = Is2 + Ft2 • F is a matrix, Fij = 1 if i, j in same family • Need to estimate both s2 and t2
Generalised Least Squares • y = Xb + e • Var(y) = V (symmetric covariance matrix) • V = Is2 (uncorrelated errors) • V = Is2 + Ft2 (single grouping random effect) • V = Is2 + Gt2 (G = genotype identity by state) • GLS solution, if V is known • unbiased, • efficient (minimum variance) • consistent (tends to true value in large samples) • asymptotically normally distributed
Multivariate Normal Distribution • Joint distribution of a vector of correlated observations • Another way of thinking about the data • Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN • y ~ MVN(m, V) • m is vector of expected values • V is variance-covariance matrix • If V = Is2 then elements of y are uncorrelated and equivalent to n independent Normal variables • probability density/Likelihood is
Henderson’s Mixed Model Equations • General linear mixed model. • b are fixed effect • u are random effects • X, Z design matrices
Mixed Models • Fixed effects models are special cases of mixed models • Mixed models sometimes are more powerful (as fewer parameters) • But check that the assumption effects are sampled from a Normal distribution is reasonable • Differences between estimated parameters and statistical significance in fixed vs mixed models • Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.
Mixed Models in R • Several R packages available, e.g. • lme4 general purpose mixed models package • emma for genetic association in structured populations • Formulas in lme4, eg test for association between genotype and trait • H0: E(y) = m • vs • H1: E(y) = m + b • Var(y) = Is2 + Ft2 • h0 = lmer(y ~ 1 +(1|family), data=data) • h1 = lmer(y ~ genotype +(1|family), data=data) • anova(h0,h1)
Formulas in lmer() • y ~ fixed.effects + (1|Group1) + (1|Group2) etc • random intercept models • y ~ fixed.effects +(1|Group1) + (x|Group1) • random slope model
Example: HDL Data • > library(lme4) • > b=read.delim("Biochemistry.txt”) • cc=complete.cases(b$Biochem.HDL) • > f0=lm(Biochem.HDL ~ Family , data=b,subset=cc) • > f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc) • > anova(f0) • Analysis of Variance Table • Response: Biochem.HDL • Df Sum Sq Mean Sq F value Pr(>F) • Family 175 147.11 0.84064 5.2098 < 2.2e-16 *** • Residuals 1681 271.24 0.16136 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • > f1 • Linear mixed model fit by REML • Formula: Biochem.HDL ~ (1 | Family) • Data: b • Subset: cc • AIC BIC logLik deviance REMLdev • 2161 2178 -1078 2149 2155 • Random effects: • Groups Name Variance Std.Dev. • Family (Intercept) 0.066633 0.25813 • Residual 0.161487 0.40185 • Number of obs: 1857, groups: Family, 176 • Fixed effects: • Estimate Std. Error t value • (Intercept) 1.60615 0.02263 70.97 • > plot(resid(f0),resid(f1))
Comparing Models containing Missing Data • R will silently omit rows of data containing missing elements, and adjust the df accordingly • R will only compare models using anova() if the models have been fitted to identical observations. • Sporadic missing values in the explanatory variables can cause problems, because models may have a different numbers of complete cases • Solution is to use the R function complete.cases() to identify those rows with complete data in the most general model, and specify these rows for modelling: cc <- complete.cases(data.frame) f <-lm( formula, data, subset=cc)
Joining two data frames on a common column • Frequently data to be analysed are in several data frames, with a column of unique subject ids used to match the rows. • Unfortunately, often the subjects differ slightly between data frames, eg frame1 might contain phenotype data and frame2 contain genotypes. Usually these two sets don’t agree perfectly because there will be subjects with genotypes but no phenotypes and vice versa • Solution 1: Make a combined data frame containing the intersection of the subjects intersected <- sort(intersect(frame2$id, frame1$id)) combined <- cbind( frame1[match(intersect,frame1$id),cols1], frame2[match(intersect,frame2$id),cols2]) • Solution 2: Make a combined data frame containing the union of the subjects union <- unique(sort(c(as.character(frame1$id),as.character(frame2$id))) combined <- cbind( frame1[match(union,frame1$id),cols1],frame2[match(union,frame2$id),cols2]) • cols1 is a list of the column names to include in frame1, • cols2 is a list of the column names to include in frame2