1 / 13

Proportion Data

Proportion Data. Harry R. Erwin, PhD School of Computing and Technology University of Sunderland. Resources. Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.

delano
Download Presentation

Proportion Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

  2. Resources • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press. • Gentle, JE (2002) Elements of Computational Statistics. Springer. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

  3. Introduction • These four demonstration sessions of this class address special types of data: • Counts • Proportions (this lecture) • Survival analysis • Binary responses

  4. Frequencies and Proportions • With frequency data, we know how often something happened, but not how often it didn’t happen. • With proportion data, we know both. • Applied to: • Mortality and infection rates • Response to clinical treatment • Voting • Sex ratios • Proportional response to experimental treatments

  5. Working With Proportions • Traditionally, proportion data was modelled by using the percentage as the response variable. • This is bad for four reasons: • Errors are not normally distributed. • Non-constant variance. • Response is bounded by 0.0 and 1.0. • The size of the sample, n, is lost.

  6. General Approach • Use a general linear model (glm). • family = binomial (i.e., unfair coin flip) • Uses two vectors, one of the success counts and the other of the failure counts. • number of failures + number of successes = binomial denominator, n • y<-cbind(successes, failures) • model<-glm(y~whatever,binomial)

  7. How R Handles Proportions • Weighted regression (weighted by the individual sample sizes). • logit link to ensure linearity • If percentage cover data • Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance). • If percentage change in a continuous measurement • ANCOVA with final weight as the response and initial weight as a covariate, or • Use the relative growth rate (log(final/initial)) as response. • Both produce normal errors.

  8. Tests • To compare a single binomial proportion to a constant, use binom.test. • To compare two samples, use prop.test. • Only use the following methods for complex models: • Regression tables • Contingency tables

  9. Count Data on Proportions • R supports the usual arcsine and probit transformations: • arcsine makes the error distribution normal • probit linearises the relationship between percentage mortality and log(dose) • However, it is usually better to use the logit transformation and assume you have binomial data.

  10. Odds • The logistic model for p as a function of x is: p = exp(a+bx)/(1 + exp(a+bx)) • The book notes that this is obviously non-linear. To linearise it, consider instead the odds p/q (as in gambling, where q is 1-p): p/q = exp(a+bx) • Or: ln(p/q) = a + bx • ln(p/q) is called the logit transformation of p

  11. R and logit • R does not simply do a linear regression of ln(p/q) against x. It also handles: • non-constant binomial variance • logit(p) going to - and +. • differences between sample sizes using weighted regression.

  12. Over-dispersion and Hypothesis Testing • Everything addressed earlier is still available for proportions data. This includes ANOVA, ANCOVA, and regression analysis. • Significance is assessed using 2 tests. • Hypothesis testing with binomial errors is less clear-cut than normal errors. Large samples (>30) are necessary. The degree to which the approximation is satisfactory is unknown. p will not be exactly known. • Over-dispersion must usually be addressed. The residual scaled deviance should be about the residual df. Use family = quasibinomial for over-dispersion.

  13. Book Examples • See discussion of how to model with binomial errors. • Logistic regression example. • Categorical explanatory variables example. • ANCOVA example.

More Related