Statistical Analysis

Statistical Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources • Rowntree, D. (1981) Statistics Without Tears. Harmondsworth: Penguin. • Hinton, P.R. (1995) Statistics Explained. London: Routledge. • Hatch, E.M. and Farhady, H. (1982) Research Design And Statistics For Applied Linguistics. Rowley Mass.: Newbury House. • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Module Outline • Day One Lectures • Introduction • Using R • Probability (the laws of chance) • Day Two Lectures • Data analysis (the gathering, display, and summarisation of data) • Experimental design (planning and sampling) • Statistical inference (the drawing of conclusions from your data knowing probability) • Data modelling (regression, ANOVA and ANCOVA)

Why the second day is important • You don’t know which tests to use unless you know how your data are structured, so you do data analysis. • Your experimental design is based on what you know beforehand of the data. • Inference is the drawing of conclusions for your research—what can you prove. • Modelling tells you what more detailed conclusions are supportable. This involves throwing out the factors that are not important.

Data Analysis • Central tendency • Degrees of freedom • Variance • A worked example • Confidence intervals • Single sample

Measures of Central Tendency • yvals<-read.table("yvalues.txt", header=T) • Attach(yvals) • Create a histogram of the data: hist(y) • Observe the mode, the most common value. • Arithmetic mean is (sum of data values)/number • total <- sum(y) • n <- length(y) • ybar <- total/n • ybar • mean(y)

Median • The ‘middle value’ • ysorted<-sort(y) • middleIndex<-ceiling(length(y)/2) • ysorted[middleIndex] • median(y) • set<-c(1,10,1000,10,1) • Geometric mean: exp(mean(log(set))) • Harmonic mean: 1/mean(1/set) • detach(yvals) • ls • rm(any variables you don’t need)

Measures of Spread • In addition to describing the central point of a data set, we’re concerned with the data spread. • Two measures: • Interquartile spread • Standard deviation/variance

Interquartile Range • Break the data into four equal groups: • First through third quartiles • The median is the second quartile, Q2 • The median of the low group is the first quartile or Q1 • The median of the high group is Q3 • The IQR is Q3-Q1

Box and Whiskers Plot (Tukey) Outlier— outside 1.5 IQR Q1 Median Q3 IQR “Whiskers” extend to furthest non-outlier in both directions

Standard Deviation and Variance • Standard measure of spread (called std in R) • Defined as the distance that an average value differs from the mean. The “squared” distance is used. (Remember geometry?) The square of the standard deviation is the variance. (Called var in R). • When sample data (count = N) are used to compute estimates of both the mean and the variance, the latter is computed by dividing by N-1. If the variance is estimated by dividing by N, the result is biased low. • The sample mean and standard deviation describe a bell-shaped curve very well if N is at least 30. • For N<30, the t distribution applies.

Using R for this • Data<-c(3,5,7,7,38) • mean(data) • std(data) • var(data) • median(data) • quantile(data) • fivenum(data) • boxplot(data)

Random Variables • Imagine an experiment repeated many times. The notation for a random variable is X. • The notation for a single value of X is x. • You can define central tendency and spread just like you can for sample data. You can also predict their values. • R gives you basic functions to compute these.

Plotting a random variable • hist(rbinom(10000,2,0.5)) (coin flip) die<-c(1,2,3,4,5,6) for(i in 1:10000){ + a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))} • hist(a,breaks=0:6+0.5) (die role) for(i in 1:10000){ + a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))+ + sample(die,1,replace=TRUE,c(1,1,1,1,1,1))} • hist(a, breaks=0:12+0.5) (dice role)

Mean and Variance of Random Variables • µ = sum over all possible values x of (x times the probability of x) • Note this involves area and can deal with continuous probability like the normal distribution. • This is the mean • The variance, 2is the sum over all x of ((x- µ)2 times the probability of x) • The standard deviation is .

Some Continuous Distributions • The density function is the probability of a sample X lying between x and x+∆x. • The density is labelled d'name' where name is used in R. For example dbinom or dnorm. The integral of the curve is called the cumulative probability distribution. So you get: • dnorm, the density function • pnorm, the cumulative probability function • qnorm, the inverse of the cumulative probability function • rnorm, to draw random numbers from the distribution

Excursis: Degrees of Freedom • Suppose you have a sample of five numbers (2,7,4,0,7) and their mean is 4. What is the sum of the five numbers? • If you know the mean and four of the numbers, how many values can the fifth one have? • This means that if you are calculating the sample standard deviation and you have the sample mean, you have one less data point than you think you do. • df = sample size minus the number of parameters, p, you’ve estimated from the data. (Memorize!) • variance = (sum of squares)/(degrees of freedom)

A Worked Example • gardens.txt in Data • Note that you can test whether two samples probably come from the same distribution (the null hypothesis). You do this by calculating the ratio of the variances, and apply the F test. • In R, this is handled by applying var.test. • The chi2and ANOVA tests comparing means assume equal variance, so you must check this first! If the F test tells you don’t have equal variance, don’t go any further.

Confidence Intervals • Variance is used for testing hypotheses and for establishing confidence intervals (measures of unreliability) • You want your measure of unreliability to • Go up if variance increases • Go down if the sample size increases • SE (standard error) = sqrt(s2/n) has those properties. • You write this as: • “the mean ozone concentration in Garden A was 3.0+/-0.365 pphm (1 s.e., n=10)”

More on Confidence Intervals • You can use the assumption of a normal distribution if n>= 30, but if you have a smaller sample, you usually use Student’s t-distribution. • For the quantiles of this distribution, use qt() • For a 95% confidence interval, use t associated with alpha = 0.975. qt(0.975,9) = 2.262 standard errors, qt(0.995,9) = 3.249836, and qt(0.9975,9) = 3.689662. • For Garden B (small sample) • “the mean ozone concentration in Garden B was 5.0+/-0.826 (95% C.I., n = 10).” • There is a better way—bootstrapping—but it’s complex.

Single Sample • Questions to answer: • What is the mean value? • Is the mean value significantly different from expectation or theory? • What is the level of uncertainty associated with our estimate of the mean? • To be reasonably certain, we need to know if the data are normally distributed, have outliers, or show serial correlation.

Worked Example • Load das.txt and follow me. • summary() • plot() • boxplot() • hist()

Normal Distribution • According to the central limit theorem, if you take a large set of samples from a population and take their means, the means will be normally distributed. • Why is deep math. • The quartiles of the normal distribution are calculated by qnorm() • Examples from book (55ff)

Testing Normality • A normal distribution is very easy to use, but you need to checkfirst. • Use qqnorm() and qqline() • Examples (y) • Examples (speed) • Note non-normality. To test a mean when the distribution is non-normal, you don’t use Student’s t. Instead you use Wilcoxon’s signed rank test. • library(ctest) • wilcox.text(speed, mu=990)

Student’s t • Use if sample sizes are <30 and normally distributed. • Use pt instead of pnorm; qt instead of qnorm • Examples from book (67ff)

Test Statistics for the Mean • If you have 30 or more samples (n), the distribution of (X-µ)/(s/√n) is approximately normal. You can test whether the mean you computed (X) is significantly different from µ by calculating that probability. • If you have less than 30 samples, (X-µ)/(s/√n) follows Student’s t distribution, and you need to use that instead. • Guess why ‘30’ is important…

Comparing two samples • To compare two variances, use Fisher’s F test, var.test(). Do this first! • For comparing sample means with normal errors, Student’s t test, t.test() (can be used for paired data) • For comparing sample means with nonnormal errors, Wilcoxon’s rank test, wilcox.test() • For proportions, use the binomial test, binom.test() (binary data) or prop.test() (binomial proportions) • For independence in contingency tables, chi-square test, chisq.test(), or Fisher’s exact test, fisher.test() • For two correlated variables, cor.test()

Two Sample Examples • Follow me on these. (73ff)

Using 2 • Lots of statistical data are in the form of counts • Contingency tables show all the possible occurrences in a sample. The question is are these statistically different?

Completing the table

Computing the probability of fair hair and blue eyes • If and only if the two traits are independent, then the probability of the combination will equal the products of the probabilities of the individual cases. • That can be estimated as about 22 cases. • Since the cell value is 38, the assumption of independence is at risk • What is the chance of the observed frequencies occurring by chance?

The 2 Test • The degrees of freedom in a contingency table equal (r-1)x(c-1), where r and c are the number of columns. • Here, df = 1. • What certainty level do you want? 95% is typical. • qchi(0.95,1) = 3.841459 • count<-matrix(c(38,14,11,51),nrow=2) • The data should be entered columnwise (like before) • To test, chisq.test(count) • Here, the correlation between fair hair and blue eyes is highly significant. • If the expected frequencies are <= 5, use Fisher’s exact test instead, fisher.test(count) or combine cells.

Summary • We have seen ways of • Describing data • Testing single sample data against null hypotheses • Testing two sample data against null hypotheses

Experimental Design Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Lecture Outline • Experimental Design • The process of defining how to collect data that will allow you to falsify a hypothesis. • How to do it. • Replication • Randomization

Categorical variables • These take discrete values. • A complete experimental design investigates every combination. This is called a factorial design. This is required for reliable results. • For example, if you have two categorical variables, A and B, with two states, 1 and 2, each, you have to explore A1B1, A1B2, A2B1, and A2B2.

Continuous Variables • You have to sample at multiple values. • For example, if an explanatory variable ranges between 1 and 10, you should run an experiment at 1 and another at 10, and a few between. • This converts the continuous variable into a categorical variable.

Sampling • You may not be able to control the values of the categorical and continuous variables. In natural experiments, you need to sample randomly. • The goal of random sampling is to move systematic response into the error term • Take care to avoid systematic sampling. If necessary, flip a coin or generate a random number.

Replication • This means you repeat a measurement with a specific value of a categorical and/or continuous explanatory variable. • This allows you to assess natural variability and measurement error. • In many experiments, 30 replications is about the maximum necessary. Less may have to be accepted, but then take care in your analysis.

Randomization • You randomize to eliminate systematic errors. • Avoid correlating your measurements in time and space. • Avoid doing things that might introduce systematic effects. • Avoid allowing your judgment to affect when, where, and with what/whom you do a given experiment. Assign treatments randomly.

The Design • The elements of an experimental design are the experimental units. • The treatments are assigned to the units. (Note that this translates continuous variables to categorical ones). • The objective of the design is to compare the treatments.

Local Control • Consider ways to reduce natural variability. • One way is to group similar experimental units into blocks. • Running all treatments on all blocks produces a complete randomised block design. • If you have enough subjects, you can repeat the design. This increases replication.

Analyzing the Results • You allocate total variability among the different sources: each factor, systematic effects, and natural variability/measurement error. • This is done using analysis of variance.

Statistical Inference Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Statistical Inference • Statistical inference is the drawing of conclusions from specific data knowing probability. • Basically, you are assessing the probability of a hypothesis given your data. • A null hypothesis is plausible but probably not true. • You show this by demonstrating that the probability of the data you collected being generated if the null hypothesis were true is very small. • This is called ‘falsifying a hypothesis’.

How Do You Falsify a Hypothesis? • Discuss

The Null Hypothesis • You start with a null hypothesis—a statistical statement that you intend to show is very unlikely. • This is usually that the observations are due to chance • Testing can involve the mean, the variance, or a comparison between two (or more) samples where one has a treatment and the other doesn’t.

The Test Statistic • This will be a statistic that assesses the evidence against the null hypothesis. • This may be a normal distribution (continuous data), a binomial distribution (coin flipping), or comparison to a second experiment with the treatment missing.

Calculating the p value • This is the probability of your results assuming the null hypothesis.

Compare the p-value to a fixed significance level, a • a is the probability of a false conclusion that you’re accepting. 0.05, 0.01, and 0.001 are typical. • Choose a before calculating p. Otherwise you’re cheating.

Statistical Analysis