180 likes | 304 Views
Statistical Modelling. Harry R. Erwin, PhD School of Computing and Technology University of Sunderland. Resources. Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
E N D
Statistical Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland
Resources • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Statistical Modelling • Suppose you have a response variable, y, that varies when independent factors or measurements, x1, x2, …, xN, vary. (One covariate can be written x.) • What you want is a model that predicts the value of y as a function of the xi. • Statistical models are written: • g( h(y) ~ f(xi) ) • Where g() describes the model—lm() or aov() is usual. • Where h() describes how to transform the response variable. • Where f(xi) describes what covariates you want in (and out) of the model. To add a covariate, you use +, to remove it, -.
Fitting Models to Data is what R is designed to do • There are five kinds of models • The saturated model—one parameter per data point—you’re drawing the response line through all the data points. This tells you nothing. • The maximal model—containing all factors, interactions, and covariates of interest that can be fit given the available data. (You need at least three data points for fitting each covariate in your model.) • The current model, usually smaller than the maximal model. • The minimum adequate model—smaller than the maximal model, but not significantly smaller. • The null model—one parameter, the overall mean response, ymean. (y~1)
Definitions • Covariate—an explanatory variable that is possibly predictive of the outcome under study. • Factor—a covariate that takes a finite number of values, in no specific order. A boolean value is a factor. • Continuous explanatory variable—a numerical covariate. It may be restricted to integer values, to represent an ordering. • Interaction—a covariate that involves more than one explanatory variable. • Power—a covariate that involves a polynomial of degree 2 or greater in one or more continuous explanatory variables.
General Process of Fitting • Fit the maximal model. Note down the residual deviance. Possibly check for overdispersion (advanced topic). • Begin model simplification. Use update -, to remove the least significant terms first. Start with the highest-order interactions and powers. • If the resulting increase in deviance is not significant, use the reduced model and continue • If the increase in deviance is significant, go back to the unreduced model and look further. • Repeat steps 3 and 4 until only significant terms remain.
Parsimony Suggests • Less parameters • Less explanatory variables • A linear model • A model without a hump (a power > 1) • A model without interactions • A model with easily measured variables • A model that reflects how the process operates.
Actions you can take • Remove non-significant interactions, higher order terms, explanatory variables • Group together factor levels (advanced topic) • In ANCOVA (mixed models with continuous variables and factors), set slopes to zero if possible • Rescale (advanced) if necessary to give • constancy of variance • approximately normal errors • additivity • Don't go to extremes, though. “If you torture the data, it will confess.”
Model Formulae • response ~ explanatory variables • The right hand side describes the variables, their interactions, and their non-linearity—it isn’t arithmetic! • To include something in the model: + something • To remove something from the model: - something • Interactions are written * (or A:B) • y ~ A + B + A:B is the same as y~A*B) • Nesting: / (A/B is A+A:B or A+B%in%A) • y ~ A/B • Conditioning is written | • y ~ x | z
Expanded Forms • A*B*C is all the interactions up to A:B:C • A/B/C is A+B%in%A+C%in%B%in%A • (A+B+C)^3 is A*B*C • (A+B+C)^2 is A*B*C - A:B:C • poly(x,n) is a polynomial regression of degree n • I(formula) means the formula ‘as written’ in R. • 1 labels the intercept • Error(A/B/C) can be specified
Use of update • “model <- lm(y~A*B)” • “model2 <- update(model, ~.-A:B)” to get rid of the interaction. • model2 is now lm(y~A+B)
Transforms (Advanced) • Sometimes you need to directly transform the left hand side to get constant variance. This looks like: • lm(log(y)~ I(1/x) + sqrt(z)) • If the left hand side has constant variance already, you can use a ‘link’, defined by family=something in the model formula, and a general linear model (glm) • Families available: • Normal (bell-shaped curve, default) • poisson (count data) • binomial (proportions or binary data) • Gamma (special)
Factor/Parameter Transforms • Used to make the shape of variables closer to ‘normal’, and with constant variance. • More advanced topic, will be mentioned in the summary lecture. If it turns out you need to do this, I’ll provide consulting support (X3227).
Types of Models • lm(y~x)—linear model (x is a continuous variable) • aov(y~x)—analysis of variance (x is a factor) • aov(y~x+z)—analysis of covariance if x and z include a factor and a continuous explanatory variable. • glm(y~x)—general linear model (like lm but advanced) • Non-constant variance • Non-normal errors • Options include family=poisson, binomial, Gamma, Normal • gam(y~x)—generalised additive model (complex, advanced) • also lme(y~x), nls(y~x), nlme(y~x), loess(y~x), tree(y~x)
Modelling Demonstration • A demonstration will be given in the analysis of covariance slides.
Model Checking (demoed later) • Plot the residuals against • fitted values • explanatory values • sequence of data collection • standard normal deviates • Demo of mcheck
mcheck mcheck <- function (obj, ... ) { rs<-obj$resid fv<-obj$fitted par(mfrow=c(1,2)) plot(fv,rs,xlab="Fitted values",ylab="Residuals") abline(h=0, lty=2) qqnorm(rs,xlab="Normal scores",ylab="Ordered residuals") qqline(rs,lty=2) par(mfrow=c(1,1)) invisible(NULL) }
Observations • Order of factor deletion will matter • Delete the high order interactions (A:B), and high-order terms (I(x^2)) first. • Then delete the remaining terms in decreasing order of importance. • The test you use is anova(), because you’re comparing two models. • Be pragmatic.