300 likes | 397 Views
Workshop in R & GLMs: #3. Diane Srivastava University of British Columbia srivast@zoology.ubc.ca. Housekeeping. ls() asks what variables are in the global environment rm(list=ls()) gets rid of EVERY variable q() quit, get a prompt to save workspace or not. hard~dens.
E N D
Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca
Housekeeping ls() asks what variables are in the global environment rm(list=ls()) gets rid of EVERY variable q() quit, get a prompt to save workspace or not
Janka exercise Conclusion: The best y transformation to optimize the model fit (highest log likelihood)… ..is not the best y transformation for normal residuals
This workshop • Linear, general linear, and generalized linear models. • Understand how GLMs work [Excel simulation] • Definitions: e.g. deviance, link functions • Poisson GLMs[R exercise] • Binomial distribution and logistic regression • Fit GLMs in R! [Exercise]
But wait…couldn’t we just code a categorical variable to be continuous? In the beginning there were… Linear models: a normally-distributed y fit to a continuous x Y x 1.2 0 1.3 0 1.1 1 0.9 1
But wait…why do we force our data to be normal when often it isn’t? Then there were… General Linear Models: a normally-distributed y fit to a continuous OR categorical x
Proud to be Poisson ! Generalized linear models No more need for tedious transformations! All variances are unequal, but some are more unequal than others… Distribution solution ! Because most things in life aren’t normal !
Log Y X What linear models do: Y X • Transform y • Fit line to transformed y • Back transform to linear y
What GLMs do: Log fitted values Y X X • Start with an arbitrary fitted line • Back-transform line into linear space • Calculate residuals • Improve fitted line to maximize likelihood Many iterations
Maximum likelihood • Means that an iterative process is used to find the model equation that has the highest probability (likelihood) of explaining the y values given the x values. • Equation for likelihood depends on the error distribution chosen • Least squares – by contrast – minimizes variation from the model. • If the data are normally distributed, maximum likelihood gives the same answer as least squares.
GLM simulation exercise • Simulates fitting a model with normal errors and a log link to data. • Your task: • understand how the spreadsheet works • find through an iterative process the best slope
Generalized linear models In least squares, we fit: y=mx + b + error In GLM, the model is fit more indirectly: y=g(mx + b + error) where g is a function, the inverse of which is called the “link function”: linkfn(expected y) = mx + b + error
Uses least squares Assumes normality Based on Sum of Squares Fits model to transformed y Uses maximum likelihood Specify one of several distributions Based on deviance Fits model to untransformed y by means of a link function LMs vs GLMs
All that really matters… • By using a log link function, we do not need to calculate log(0). • Be careful! A log link model predicts log y not y! • Error distribution need not be normal : Poisson, binomial, gamma, Gaussian (=normal)
Exercise 1. Open up the file : Rlecture.csv diane<-read.table(file.choose(),sep=“,",header=TRUE) 2. Look at dataframe. Make treat a factor (“treat”) 3. Fit this model: my.first.glm<-glm(growth~size*treat, family=poisson (link=log), data=diane); summary(my.first.glm) 4. Model dignostics par(mfrow=c(2,2)); plot(my.first.glm)
Overdispersion Underdispersed Overdispersed Random
Overdispersion Is your residual deviance = residual df (approx.)? If residual dev>>residual df, overdispersed. If residual dev<<residual df, underdispersed. Solution: second.glm<-glm(growth~size*treat, family = quasipoisson (link=log), data=diane); summary(second.glm)
Options family default link other links binomial logit probit, cloglog gaussian identity Gamma -- identity,inverse, log poisson log identity, sqrt
Binomial errors • Variance gets constrained near limits; binomial accounts for this • Type 1: Classic example: series of trials resulting in success (value=1) or failure (value=0). • Type 2: Also continuous but bounded (e.g. % mortality bounded between 0% and 100%).
Logistic regression • Least squares: arcsine transformations • GLMs: use logit (or probit) link with binomial errors y x
Logit p = proportion of successes If p = eax+b / (1+ eax+b) calculate: loge(p/1-p)
Logits continued Output from logistic regression with logit link: predicted loge (p/1-p) = a+bx To obtain any expected values of p, need to input a and b in original equation: p = eax+b / (1+ eax+b)
Binomial GLMs Type 1 binomial • Simply set family = binomial (link=logit) Type 2 binomial • First create a vector of % not parasitized. • Then “cbind” into a matrix (% parasitized, % not parasitized) • Then run your binomial glm (link = logit) with the matrix as your y.
Homework • Fit the binomial glm survival = size*treat • 2. Fit the bionomial glm parasitism = size*treat • 3. Predict what size has 50% parasitism in treatment “0”