290 likes | 382 Views
Summary of Remainder. Harry R. Erwin, PhD School of Computing and Technology University of Sunderland. Resources. Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.
E N D
Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland
Resources • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press. • Gentle, JE (2002) Elements of Computational Statistics. Springer. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Topics • Multiple Regression • Contrasts • Count Data • Proportion Data • Survival Data • Binary Response • Course Summary
Multiple Regression • Two or more continuous explanatory variables • Your problems are not restricted to order. You often lack enough data to examine all the potential interactions and higher-order effects. • To explore the possibility of a third order interaction term with three explanatory variables (A:B:C) requires about 38 = 24 data values. • If there’s potential for curvature, you need 33 = 9 more data values to pin that down. • Be selective. If you are considering an interaction term, you have to consider all the lower-order interactions and the individual explanatory variables in it.
Issues to Consider • Which explanatory variables to include. • Curvature in the response to explanatory variables. • Interactions between explanatory variables. (High order interactions tend to be rare.) • Correlation between explanatory variables. • Over-parameterization. (Avoid!)
Contrasts • Contrasts are the basis of hypothesis testing and model simplification in ANOVA • When you have more than two levels in a categorical variable, you need to know which levels are meaningful and which can be combined. • Sometimes you know which ones to combine and sometimes not. • First do the basic ANOVA to determine whether there are significant differences to be investigated.
Model Reduction in ANOVA • Basically how you reduce a model in ANOVA is by combining factor levels. • Define your contrasts based on the science: • Treatment versus control • Similar treatments versus other treatments. • Treatment differences within similar treatments. • You can also aggregate factor levels in steps. • See me if you need to do this. R can automate the process.
Count Data • With frequency data, we know how often something happened, but not how often it didn’t happen. • Linear regression assumes constant variance and normal errors. This is not appropriate for count data: • Counts are non-negative. • Response variance usually increases with the mean. • Errors are not normally distributed. • Zeros are hard to transform.
Handling Count Data in R • Use a glmmodel with family=poisson. • This sets errors to Poisson, so variance is proportional to the mean. • This sets link to log, so fitted values are positive. • If you have overdispersion (residual deviance greater than residual degrees of freedom), use family=quasipoisson instead.
Contingency Tables • There is a risk of data aggregation over important explanatory variables (nuisance variables). • So check the significance of the real part of the model before you eliminate nuisance variables.
Frequencies and Proportions • With frequency data, you know how often something happened, but not how often it didn’t happen. • With proportion data, you know both. • Applied to: • Mortality and infection rates • Response to clinical treatment • Voting • Sex ratios • Proportional response to experimental treatments
Working With Proportions • Traditionally, proportion data was modelled by using the percentage as the response variable. • This is bad for four reasons: • Errors are not normally distributed. • Non-constant variance. • Response is bounded by 0.0 and 1.0. • The size of the sample, n, is lost.
Testing Proportions • To compare a single binomial proportion to a constant, use binom.test. • y<-c(15,5) • binom.test(y,0.5) • y<-c(14,6) • binom.test(y,0.5) • To compare two samples, use prop.test. • prop.test(c(14,6),c(10,10)) • Only use glm methods for complex models: • Regression tables • Contingency tables
GLM Models for Proportions • Start with a general linear model (glm). • family = binomial (i.e., unfair coin flip) • Use two vectors, one of the success counts and the other of the failure counts. • number of failures + number of successes = binomial denominator, n • y<-cbind(successes, failures) • model<-glm(y~whatever,binomial)
How R Handles Proportions • Weighted regression (weighted by the individual sample sizes). • logit link to ensure linearity • If percentage cover data (e.g., survey data) • Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance). • If percentage change in a continuous measurement (e.g. growth) • ANCOVA with final weight as the response and initial weight as a covariate, or • Use the relative growth rate (log(final/initial)) as response. • Both produce normal errors.
Count Data in Proportions • R supports the traditional arcsine and probit transformations: • arcsine makes the error distribution normal • probit linearises the relationship between percentage mortality and log(dose) • It is usually better to use the logit transformation and assume you have binomial data.
Death and Failure Data • Applications include: • Time to death • Time to failure • Time to event • This is useful way to analyse performance when the process leading to a goal is complex—for example when it is a robot performing a task.
Problems with Survival Data • Non-constant variance, so standard methods are inappropriate. • If errors are gamma distributed, the variance is proportional to the square of the mean. • Use a glm with Gamma errors.
How do we deal with events that don’t happen during the study? • In those trials, we don’t know when the event would occur. We just know the time would be greater than the end of the trial. Those trials are censored. • The methods for handling censored data make up the field of survival analysis. • (I used survival analysis in my PhD work. My wife does survival analysis for cancer data.)
Binary Response • Very common: • dead or alive • occupied or empty • male or female • employed or unemployed • Response variable is 0 or 1. • R assumes a binomial trial with sample size 1.
When to use Binary Response Data • Do a binary response analysis only when you have unique values of one or more explanatory variables for each and every possible individual case. • Otherwise lump: aggregate to the point where you have unique values. Either: • Analyse the data as a contingency table using Poisson errors, or • Decide which explanatory variable is key, express the data as proportions, recode as a count of a two-level factor, and assume binomial errors.
Modelling Binary Response • Single vector with the response variable • Use glm with family = binomial • Think about a log-log link instead of logit. Use the one that gives less deviance. • Fit the usual way. • Test significance using 2.
Course Summary • We’ve had an introduction to thinking critically about data. • We’ve seen how to use a typical statistical analysis system (R). • We’ve looked at our projects critically. • We’ve discussed hypothesis testing. • We’ve looked at statistical modelling.
Statistical Activities • Data collection (ideally the statistician has a say on how they are collected) • Description of a dataset • Averages • Spreads • Extreme points • Inference within a model or collection of models • Model selection
Why Model? • Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model. • A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.
Structure-in-the-data • Of most interest…, for example: • Modes • Gaps • Clusters • Symmetry • Shape • Deviations from normality • Plot the data to understand this.
Visualization • Multiple views are necessary. • Be able to zoom in on the data as a few points can obscure the interesting structure. • Scaling of the axes may be necessary, since our eyes are not perfect tools for detecting structure. • Watch out for time-ordered or location-ordered data, particularly if time or location are not explicitly reported.
Plots • Use simple plots to start with. • Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.
Bottom Line • I am available for consulting (free). • E-mail: harry.erwin@sunderland.ac.uk • Phone: 515-3227 or extension 3227 from university phones. • Plan on about an hour meeting to allow time to think intelligently about your data.