160 likes | 336 Views
Proportion Data. Harry R. Erwin, PhD School of Computing and Technology University of Sunderland. Resources. Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.
E N D
Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland
Resources • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press. • Gentle, JE (2002) Elements of Computational Statistics. Springer. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Introduction • These four demonstration sessions of this class address special types of data: • Counts • Proportions (this lecture) • Survival analysis • Binary responses
Frequencies and Proportions • With frequency data, we know how often something happened, but not how often it didn’t happen. • With proportion data, we know both. • Applied to: • Mortality and infection rates • Response to clinical treatment • Voting • Sex ratios • Proportional response to experimental treatments
Working With Proportions • Traditionally, proportion data was modelled by using the percentage as the response variable. • This is bad for four reasons: • Errors are not normally distributed. • Non-constant variance. • Response is bounded by 0.0 and 1.0. • The size of the sample, n, is lost.
General Approach • Use a general linear model (glm). • family = binomial (i.e., unfair coin flip) • Uses two vectors, one of the success counts and the other of the failure counts. • number of failures + number of successes = binomial denominator, n • y<-cbind(successes, failures) • model<-glm(y~whatever,binomial)
How R Handles Proportions • Weighted regression (weighted by the individual sample sizes). • logit link to ensure linearity • If percentage cover data • Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance). • If percentage change in a continuous measurement • ANCOVA with final weight as the response and initial weight as a covariate, or • Use the relative growth rate (log(final/initial)) as response. • Both produce normal errors.
Tests • To compare a single binomial proportion to a constant, use binom.test. • To compare two samples, use prop.test. • Only use the following methods for complex models: • Regression tables • Contingency tables
Count Data on Proportions • R supports the usual arcsine and probit transformations: • arcsine makes the error distribution normal • probit linearises the relationship between percentage mortality and log(dose) • However, it is usually better to use the logit transformation and assume you have binomial data.
Odds • The logistic model for p as a function of x is: p = exp(a+bx)/(1 + exp(a+bx)) • The book notes that this is obviously non-linear. To linearise it, consider instead the odds p/q (as in gambling, where q is 1-p): p/q = exp(a+bx) • Or: ln(p/q) = a + bx • ln(p/q) is called the logit transformation of p
R and logit • R does not simply do a linear regression of ln(p/q) against x. It also handles: • non-constant binomial variance • logit(p) going to - and +. • differences between sample sizes using weighted regression.
Over-dispersion and Hypothesis Testing • Everything addressed earlier is still available for proportions data. This includes ANOVA, ANCOVA, and regression analysis. • Significance is assessed using 2 tests. • Hypothesis testing with binomial errors is less clear-cut than normal errors. Large samples (>30) are necessary. The degree to which the approximation is satisfactory is unknown. p will not be exactly known. • Over-dispersion must usually be addressed. The residual scaled deviance should be about the residual df. Use family = quasibinomial for over-dispersion.
Book Examples • See discussion of how to model with binomial errors. • Logistic regression example. • Categorical explanatory variables example. • ANCOVA example.