Statistical (Regression) Modelling: From SLR through NLR to GLMs

Statistical (Regression) Modelling:From SLR through NLR to GLMs Andrew Mead (School of Life Sciences)

Statistical (Regression) Modelling • Regression modelling provides a range of techniques to relate a response variable to one or more explanatory variables • Why fit regression models? • To describe an observed response and assess a possible causal relationship • How well does the model describe the data? • To investigate and test hypothetical mathematical models • Does the model fit the observed data? • To increase our understanding of some underlying process • What do the model form and parameter values tell us? • To predict the values of one variable from those of one or more other variables

The Modelling Process(from Gilchrist “Statistical Modelling” (1984)) • Five major stages • Identification • Finding or choosing an appropriate model (conceptual, empirical or somewhere in-between) • Estimation and fitting • Finding the parameters for the selected model that best fit your data • Validation • During development – model form; during testing – against new data; during application – that circumstances have not changed • Application • Important to relate aspects above to the ultimate purpose for which the model is required • Iteration • Continuous re-evaluation during all stages above

Models! All models are wrong, but some are useful George E.P. Box

Types of model • Empirical or Mechanistic • Describing the observed relationship OR based on a proposed or inferred mechanism • Deterministic or Stochastic • Gives definite predictions OR contains random elements • i.e. predicts mean value plus variability about the mean • Static or Dynamic • Constant over time OR behaviour of a system over time • Regression techniques can be useful for all types of model, though most usually for Stochastic, Empirical systems

General statistical modelling process • Choose the correct distribution for the response variable • If Normal – standard regression techniques • If non-Normal – generalised linear models • Choose an appropriate form of equation • Could be based on the observed shape of response • Could be based on knowledge of the underlying mechanism • Fit the equation to the data • Estimate the parameter values and get predictions • How well does the model fit? • Can the model be simplified? • Can the model be generalised?

Content • Simple Linear Regression • Extensions for multiple explanatory variables • Extensions for multilevel data • Non-linear Regression • Generalised Linear Models • Generalised Regression Modelling

Simple Linear Regression • Simplest model • Fitting a straight line to relate one quantitative variable to another • Inference that the response (Y) variable is dependent on the explanatory (X) variable • Suggesting some sort of causal relationship • Linear doesn’t relate to the shape of the response • Used in the mathematical sense in how the response relates to the parameters • Quadratic relationship (Y =  + X + X2)is also a linear model • Simple does not mean easy! • Used to indicate one explanatory variable, to contrast with multiple linear regression (two or more explanatory variables)

Choose line which minimises the sum of squared vertical deviations from the line Explanatory (x) variable assumed to be measured without error Method based on “least squares principle” Just like ANOVA Sum of squared deviations can be expressed as where a and b are the intercept and slope of the fitted line Parameters can be found algebraically by solving the set of (simultaneous) least squares equations Fitting a Simple Linear Regression

Testing the relationship • Is the relationship “real”? Is the slope (b) different from zero? • Compare the slope (b) estimate with the standard error of this estimate • Using a t-test • If the slope is sufficiently large we have evidence for a “real relationship” • Could also test the intercept(a) parameter • Against zero or some other value, though usually not of much interest! • Analysis of variance can also be used to summarise the model fit • Compare variation explained by the model with the residual (remaining) variation (the variability of the observations about the fitted line) • A “real” relationship if the variation explained by the line is sufficiently large • Other statistics (coefficient of determination (R2), adjusted R2(% variance accounted for) are also useful sumamries

Assumptions • Similar to those for two-sample t-tests and ANOVA • Normality: the residuals are a random sample drawn from a Normal distribution • Homoscedasticity: the residuals have a constant variance across all values of the explanatory variable • Independence: the residuals are independent • PLUS • That the underlying relationship is linear! • Best to check these graphically • Particularly that the form of the model (a straight line) is appropriate for the data • Tests of Normality and homoscedasticity are generally not very powerful unless the sample size is large

Other issues • Leverage • Observations far away from the mean of the explanatory variable are highly influential • Think about a see-saw! • Design • Want values of explanatory variable across range of interest • Ideally plenty of observations towards the extremes • But also observations in the middle of the range (to test for linearity of the relationship • Prediction • Fitted models can be used to predict response for “new” values of the explanatory variable • Can construct confidence intervals around these • Be wary of extrapolation • Prediction outside the range of the observed data • Potential problems with the form of the relationship plus large standard errors

Additional explanatory variables • Real data sets are usually more complicated! • Additional qualitative variables (factors) • Do the parameter values change with the levels of the qualitative variables? • Additional quantitative variables • Which of the quantitative variables are important in explaining the variability in the response variable? • Potentially including interactions between variables (effect of one variable depends on value of another) • Is the shape of the relationship more complex • Consider polynomial models • Just an additional quantitative variable, but with a structure relating the different quantitative variables

Linear Regression with Groups • Comparison of regression lines for independent data sets, or for data at different levels of a qualitative factor • Are the relationships the same? • Can the values of one or more of the parameters be constrained to be the same? • While still providing a good fit to the data! • Also know as “analysis of parallelism” • Aim: to find the most parsimonious model • Provides a good description of the observed data • Uses as few parameters as possible • As simple a model as possible! • Four possible models

Comparison of models • Two sequences of models • Single line < Parallel lines (intercept parameter different) < Separate lines (all parameters different) • Single line < Coincident lines (slope parameter different) < Separate lines (all parameters different) • More complex model is justified if a significantly large additional proportion of the variance is explained • Various criteria for assessing this • Commonly consider F-test for additional variance explained relative to residual variance • Equivalent to a t-test for differences between the parameter values (if only two levels) • Adjusted R2 statistic will increase if additional complexity is justified • Statistic adjusts for additional model degrees of freedom • Also could consider Akaike Information Criterion (AIC) and related statistics

Multiple Linear Regression • Multiple continuous explanatory variables • Which provides a good description of the observed response? • Could include interaction terms, assessing the impact of combinations of variables (obtain via multiplication) • Could consider all possible models (as for Linear Regression with Groups) • Potentially large number of models with even relatively few potential explanatory variables • E.g. 64 possible models with only 6 explanatory variables • Modern computer packages make this possible, with appropriate summaries of the best models for different numbers of parameters • Variable selection methods provide an alternative approach

Variable selection methods • Two stepwise approaches • Forward selection • Start with null model (no variables included) • Does adding any term (variable) improve the model fit? • Need some criteria to assess this – variance ratio, significance of F-test, AIC, … • Add “best” term (variable) that provides the greatest improvement, to model • Repeat until no potential terms (variables) satisfy the above criterion • Backward elimination • Start with full model (all variables included) • Does removing any term (variable) not make the overall fit worse? • Again, need some criteria to assess this • Remove the variable that has the lest impact • Repeat until no further terms (variables) can be removed without making the overall fit worse

Interpretation of MLRs • Using a variable selection approach is useful in finding a reasonably parsimonious model • But the final models will probably be different depending on which method and which criteria are used • Need to refit the final model and estimate all the parameters • Interpretation of these relative to their standard errors is necessary to determine the impact each has on the response • Also check assumptions (just as above) and plot predicted values against observed values to check that form of model is appropriate • Parameters (slopes) indicate how the response changes with each the explanatory variables, BUT these effects are adjusted for the effects of the other variables in the model • So interpretation is often not straightforward!

Multilevel models • Statistical models for data collected within a hierarchical structure, so that different parameters are associated with different levels of the hierarchy • Also know as hierarchical linear models, nested models, mixed models, random coefficient models, random-effect models, random parameter models, split-unit designs • In designed experiments • Split-unit designs – levels of one treatment factor applied to large experimental units (e.g. groups of plants, animals, etc.) and levels of another applied to smaller experimental units (e.g. individual plants, animals, etc.) • Information about treatments effects are estimated at different levels of the design, and compared with estimates of the residual variation at those different levels

Multilevel models in regression • In a regression context multilevel models are appropriate where observational units are organized at more than one level • E.g. in educational studies there will be individuals within classes, with classes within schools, and schools within administrative areas • Models for the responses of individuals might include parameters for each individual, with further parameters included at the class, school and administrative area level • E.g. we fit a regression equation for each individual (or maybe across individuals), and then model these parameters as a function of variables measured at the class, school or administrative area level • At each level there is an error component • Ideally, we should be able to include all these components in one model fit – we will return to a related example at the end of the seminar

Non-linear Models • Linear models dominated statistical modelling before the age of powerful computers • Because the calculation of parameters and other statistics was algebraic • Relatively few real relationships can be described by linear models • Polynomial functions are too restrictive – functions are certainly not realistic for real life responses outside of narrow ranges – good for description but not for prediction or understanding • A range of non-linear models can be used to describe many real relationships • Exponential growth or decay, sigmoidal growth curves, hyperbolae (e.g. Michaelis-Menten), Fourier curves (cyclic behaviour), Gaussian curves (distributions of responses)

Fitting non-linear curves • No algebraic solution – iterative approach needed • Search algorithms needed – e.g. Newton-Gauss minimisation or Nelder-Mead simplex method used to find the maximum likelihood solutions • Maximum likelihood – the parameters that are most likely given the data • Fix linear parameters and re-estimate non-linear parameters • Fix non-linear parameters and re-estimate linear parameters • Repeat until no improvement in likelihood (related to residual variability for Normally distributed data) • Good initial “guesstimates” of parameters are needed • Possible for process to fail to converge if poor initial values

Testing, Interpretation and Extensions • Presentation of results similar to that for linear regression • Parameters can now often be interpreted in terms of the underlying process • E.g. for logistic curve – rate of growth, timing of growth, maximum and minimum values • Use ANOVA and other summary statistics to assess how well the model fits • Compare models for different data sets (levels of a qualitative factor) • Constrain different (sets of) parameters and compare nested models • Combine models for multiple explanatory variables • E.g. by modelling parameters for a first explanatory variable in terms of the levels of a secondary explanatory variable

The Normality Assumption • All regression modelling approaches so far have assumed Normally distributed and homogeneous residuals • But some types of data do not satisfy these assumptions • E.g. count data (may follow a Poisson distribution) and proportions based on counts (may follow a Binomial distribution) • When from designed experiments transform these data to satisfy these assumptions • Preference for modelling the data on the original measurement sale rather than modelling the transformed data • Sometimes a transformation can lead to nonsensical values over some portion of the “design space” • Fitted equations can be very difficult to interpret when the equations are “back-transformed” to the original scale

Generalised Linear Models (GLMs) • A GLM is basically a regression model • Where we assume Normal errors, we write the response as a function of the explanatory variables • and the error term is assumed to be Normal, etc. • Can also write the equation in terms of the mean of the responses • The right-hand side of the equation is referred to as the linear predictor and is often written in vector notation

Exponential family and link functions • In a GLM the response can have any distribution from the exponential family • Includes the Normal, Poisson, binomial and gamma distributions • The relationship between the response mean  and the linear predictor is determined by a link function • The regression model representing the mean response is then given by

Identity, log and logit links • In the standard regression model (i.e. for Normal errors) we have the identity link as • Another example is the log link • Which is often used with count data (Poisson responses) and with continuous data with log tails • Another important link function is the logit link which is used with binomial data • Where  is usually the probability of response • This leads to the model

Probit and Logit models for proportions based on counts (Binomial) • Original development for bioassay data • Proportion of insects killed by different pesticide doses • Link functions related to the distribution of individual tolerances of insects to the pesticide • Probit function assumes that the individual tolerances follow a Normal distribution • Straight line model within the link function corresponds to a sigmoidal response on the proportion scale • Parameters relate to the dose required to kill 50% (ED50 – essentially the intercept parameter) and the variability of the tolerance distribution (slope parameter) • Idea extended for other link functions, and for other, more complicated models within the link function

Log-linear models for count data (Poisson) • Log link function turns an additive model into a multiplicative model • A natural scale for the effects on counts • Treatment effects are constant ratios • Large initial count will have a the same proportional reduction, and therefore a larger absolute reduction • Approach also applied to multiway tables of counts (multinomial data) • Extension of the Chi-squared test for association in a two-way contingency table • Allows assessment of the impact of different explanatory factors on the distribution of frequencies for some response variable

Fitting, Testing, Interpretation and Extensions • Fitting uses an iterative approach • Iteratively reweighted least squares • Assessment of model fit using “analysis of deviance” • Essentially the same as ANOVA, but modified to account for the underlying distribution • Parameters estimated on the scale defined by the ink function along with standard errors • Test for differences from zero (just as for linear regression) • Usually calculate predictions on the original measurement scale rather than on the scale defined by the link function • Can extend to include a range of linear models within the link function • Also extend to a non-linear version, with additional parameters outside of the link function • E.g. extension of the probit bioassay model to allow for control mortality (insects that die at zero dose) and natural immunity (insects that do not die at high doses) • Final example to illustrate a combination of modelling approaches

Generalised Regression Modelling • Study of the long-term viability of stored seeds under a range of storage conditions • Important for determining how long seeds can be stored before the seed population needs to be re-generated, i.e. in germplasm seed banks • Batches of seeds stored for various periods of time at specified temperatures and moisture contents • Seeds germinated under optimal conditions on retrieval from storage, and number of germinated seeds recorded • Data are Binomial counts • Number germinated out of the number tested

Original model (Ellis & Roberts, 1980) • Assumed that the frequency of seed deaths in time under constant storage conditions followed a Normal distribution • So a Probit model might be appropriate to describe the loss of viability • Further assumptions • Initial viability was independent of storage conditions • Rate of loss of viability depended on storage temperature and moisture content • Initial modelling used probit models to estimate the slope (rate of loss of viability) and intercept (initial viability) for each set of storage conditions • Second analysis step used multiple linear regression to relate the slope parameters to storage temperature and moisture content for each fitted model

Original Model Fits

New model (Mead & Gray, 1999) • Modified model to • Add a “control viability” parameter to model the maximum viability of the seed lot • Incorporated the multiple linear regression model for the slope of the Probit curve into a “one-step” analysis process • Allows data from multiple storage conditions to be analysed together • Parameters for the effects of temperature and moisture estimated to provide the best fitting probit curves across all temperatures and moisture contents • Essentially a multilevel model with a multiple regression model within a non-linear GLM with Binomial error structure and Probit link function

Modified model fit

Summary • Linear Regression models provide a convenient framework for exploring simple relationships between a single response variable and one or more explanatory variables (including both quantitative and qualitative variables) • Extensions to cope with multilevel organisational structures • Non-linear Regression models allow the fitting of more realistic functional shapes, and the fitting of conceptual models for a variety of functional shapes • Generalised Linear Models provide an extension to cope with data that are not Normally distributed with homogenous errors • Combinations of these approaches provide a wide range of statistical modelling approaches capable of modellingdata across a wide range of scenarios

Statistical (Regression) Modelling: From SLR through NLR to GLMs