A BRIEF INTRODUCTION TO STATISTICS WITH R

A BRIEF INTRODUCTION TO STATISTICS WITH R

Workshop Notes • http://www.cs.utsa.edu/~jroy/workshop • Data is from the University of York project on variation in British liquids. • JK Local, Alan Wrench, Paul Carter

References • Woods, Anthony, Paul Fletcher and Arthur Hughes. 1986. Statistics in Languages Studies. Cambridge: Cambridge University Press. • Rietveld, Toni and Roeland van Hout. 2005. Statistics in Language Research: Analysis of Variance. New York: Mouton de Gruyter. • Dalgaard, Peter. 2002. Introductory Statistics with R. New York: Springer. • Venables, W.N. and B.D. Ripley. 2002. Modern Applied Statistics with S. New York: Springer.

Modeling Data • Statistical Modeling • Randomness • Main Focus: Account for the randomness so that significant patterns can be seen. • Other types: Deterministic (mathematical), Information Theoretic.

Types of Data • Continuous [Interval] • Formant Frequency • Continuous in theory [not necessarily in practice] • Discrete • Counts (e.g. # of students) • Binary (e.g. /t,d/ or Ø) • Rates (e.g. 3 per 2 hours) • Non-numerical • Ordinal (some implied order) • High, Medium, Low • A+, A, A-, B+, B, C+, C, D, F • Non-ordinal (no intrinsic order) • Sex

Probability • [0,1] • Probabilities cannot be larger than 1 (or 100 %) or less than 0 (or 0%) • If I have only three possible events (E1, E2 and E3) their sum of probabilities is one. • P(E1)+ P(E2) + P(E3) =1

Probability Functions • Mathematical functions of probability • For discrete variables probability is fairly straightforward to calculate, but for continuous variables you have to use calculus. • Continuous variables often have lookup tables that make this easier.

Binomial • If we have n events that can be classified as either success or failure and success has the same chance for each event, then we can use a binomial probability distribution as the model. • Examples of binomial variables: • Tossing a coin {Heads or Tails} • -t,d deletion {/t,d/, Ø}

Binomial Distribution e • p(x) = probability of x successes • p = probability of a success • n = number of trials • x = number of successes.

Example of Binomial • Flipping a quarter 10 times. Each flip has a .5 probability of success. • X (the number of heads) is unknown. • What is the probability of 5 heads? P(X=5) = 5!/(5!5!) * .5^5 (1-.5)^(10-5) = .25

Normal Distribution • [Bell Curve]: Probability is symmetric about a mean. • Example of normal (assumed normal variables) • Grades • IQ • Income • Two population parameters: Mean (µ) and Variance (2) [we usually write x~N(µ, 2) read as “X is distributed normally with a mean mu and variance sigma squared] Continuity creates a problem for measuring probability (e.g. P(IQ = 100) is 0, but the P(70<IQ<120)=.8) •  is the scale parameter and represents the width of the distribution. • µ is the location parameter and represents the center of the distribution • Normal distributions are symmetric about the mean.

Standard Normal • Usually, we try to reduce a normal variable to its standardized form. • This standardized form is usually referred to with the variable z and is distributed N(0,1).

Normal Distibutions

Normal Distribution Table P(0 ≤ z ≤ a)

Measures of Central Tendenancy • Mean -- Intuitively the average value of a variable. (The average GPA of the undergrad students is 7.0) • This can be skewed (4 people have a gpa of 9.5 and one person has a 5.0 the mean is 8.6) • Median -- The value of x that divides the total probability into .5 on both sides.

Measures of Spread • Range: The distance between the highest and the lowest variable. • Variance: The average squared distance of all x weighted by p(x) from the mean. • Standard Deviation: The square root of the variance.

Population Parameters vs Sample Statistics • Each of the previous measures exist for probability distributions. These are usually referred to as population parameters. • Sample statistics are calculated from random samples of the population.

Sample Statistics • Descriptive Statistics • Sample Mean • Sample Median • Sample Variance • Sample Standard Variation

Sample Statistics • Inferential Statistics • Testing hypotheses about the population from which the sample(s) originated. • Forming intervals that describe the possible values of population parameters based on a sample • Provides a framework for interpreting samples in a consistent methodical manner.

Hypothesis Testing • Formulating a hypothesis into a null and alternative hypothesis. • Suppose I want to test the hypothesis that the population mean of men in the production of /l/ for the first formant at the first measure is less than 1220? • H0: µ=455 • HA: µ≠455 • Selecting a alpha-value • Probability of rejecting the null hypothesis when the null hypothesis is true. • Usually this is .05 (depending on what your doing, alpha values of up to .10 may not be unreasonable)

Errors

Hypothesis Testing • From your alpha value, you can select your critical region (or rejection region). • This is usually done from a look-up table (or computer program). • Calculate our test statistic. We are going to assume we know the population variance. Since we have more than 30 measurements, we can use this test.

Hypothesis Testing • T=1.78 • Rejection Region for (since we are doing a two-sided test) is Z(.975) = 1.96 and Z(.025)=-1.96. • .05/2 = .025; 1-.025=.975 and 0+.025 = .025 • We Reject the null hypothesis if T>Z(.975) or T<Z(.025). • Neither is true, so we fail to reject the null hypothesis.

P-values • P-values are the probability that we reject the null hypothesis given the null hypothesis is true. Low p-values indicate greater statistical significance. • For our data, the p-value is .0375.

Hypothesis Testing One Sample t-test data: york.male$F1.0 t = 1.7859, df = 159, p-value = 0.07602 alternative hypothesis: true mean is not equal to 455 95 percent confidence interval: 452.3576 507.5638 sample estimates: mean of x 479.9607

T-tests • Suppose I want to test the hypothesis that men and women are different in their production of /l/ for the first measurement of the first formant. • What is the null? • What test should I use?[I have a large sample and I assume equal variance] • df = n1 + n2 − 2

T-tests Two Sample t-test data: york.data$F1.0 by york.data$Sex t = 3.9047, df = 318, p-value = 0.0001152 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 48.68323 147.56921 sample estimates: mean in group female mean in group male 578.0869 479.9607

Confidence Interval • We estimate CI’s for populations parameters based on the data and a set of assumptions.

Confidence Interval • For Men, we get that a 95% CI for the mean value of F1.0 is (452.3576, 507.5638)

Interpretation of 95% CI • Many people want to say that a 95% confidence interval means that there is a 95% chance that the confidence interval contains the population mean. But any particular confidence interval either contains the population mean, or it doesn’t. The confidence interval shouldn’t be interpreted as a probability. • If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

Assumption of Equal Variance • We can test for equal variance in the same manner we test for equal mean. F test to compare two variances data: york.data$F1.0 by york.data$Sex F = 2.2331, num df = 159, denom df = 159, p-value = 5.986e-07 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 1.634604 3.050717 sample estimates: ratio of variances 2.233096

Correlation • Many times we are not interested in the differences between two groups, but instead the relationship between two variables on the same set of subjects. • Ex: Are post-graduate salary and gpa related? • Ex: Is the F1.0 measurement related to the F1.1 measurement? • Correlation is a measurement of LINEAR dependence. Non-linear dependencies have to be modeled in a separate manner.

Correlation • There is a theoretical correlation, usually represented by ρX,Y • We can calculate the sample correlation between two variables (x,y) The Pearson Coefficient is given to the left. • This will vary between -1.0 and 1.0 indicating the direction of the relationship.

Correlation Pearson's product-moment correlation data: york.data$F1.0 and york.data$F1.1 t = 45.9262, df = 318, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9161942 0.9452264 sample estimates: cor 0.932194

Now to R

A BRIEF INTRODUCTION TO STATISTICS WITH R