790 likes | 947 Views
Statistical Methods for Corpus Analysis. Xiaofei Lu APLNG 596D July 14, 2009. Overview. Describing data Comparing groups Describing relationships. 2. Basic concepts. Probability experiments – jargon Experiment : a situation for which the outcomes occur randomly
E N D
Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009
Overview Describing data Comparing groups Describing relationships 2
Basic concepts Probability experiments – jargon Experiment: a situation for which the outcomes occur randomly Sample space (Ω): the set of all possible outcomes Outcome (w): a point in the sample space Event: a subset of the sample space 3
Example 1 Experiment: toss a fair die 6 outcomes: 1,2,3,4,5,6 Sample space = Ω = {1,2,3,4,5,6} An event is any subset of the sample space “an even number is rolled”: A = {2,4,6}, P(A) = 1/2 “ a 3 is rolled”: B = {3}, P(B) = 1/6 4
Example 2 Experiment: toss 2 fair dice Outcomes: ordered pairs (x, y); x and y are results of the 1st and 2nd toss respectively Sample space = Ω = set of such ordered pairs = {(x,y)|x = 1, 2,…, or 6 and y = 1,2,…, or 6} An event is any subset of Ω, e.g., “sum is 7” A = {(x,y}|x+y=7} = {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)} 5
More jargon A=“sum is 7”; B=“first toss is an odd number” Union of two events The event C that either A or B occurs or both occur Intersection of two events The event C that both A and B occur, C=A∩B Complement of an event The event that A does not occur Disjoint events Two events with no common outcome 6
Independence Two events A and B are independent if knowing that one had occurred gave no information about whether the other had occurred P(A∩B) = P(A)P(B) Outcomes of two successive tosses of an unbiased coin P(2 heads)=P(A=1∩B=1)=P(A=1)×P(B=1)=1/4 7
Random variable Random variable (X) Essentially a random number Formally a function from Ω to the real numbers Discrete random variable A random variable that can take on only a finite or a countably infinite number of possible values 8
Example 3 Experiment: toss a biased coin 3 times Bias: P(Heads) = 0.6 Ω = {hhh, hht, htt, hth, ttt, tth, thh, tht} X = total number of heads in the 3 tosses X is a r.v., a function from Ω to the real numbers with possible values (x) 0, 1, 2, 3 9
Example 3 (cont.) P(x=0)=0.064 P(x=1)=0.288 P(x=2)=0.432 P(x=3)=0.216 10
Random variable (cont.) Continuous random variable A random variable that can take an uncountably infinite number of possible values, e.g., height Defined over an interval of values, e.g., (0,2], and represented by the area under a curve The probability of observing any single value is equal to 0 11
Probability distribution Describes the possible values of a random variable and their probabilities Probability mass function (discrete) Probability density function (continuous) 12
Descriptive vs. inferential statistics Descriptive statistics Summarize important properties of observed data Measures of central tendency Measures of variability Inferential statistics The use of statistics to make inferences concerning some unknown aspect of a population Hypothesis testing 13
Measures of central tendency The most typical score for a data set The mode The most frequently obtained score in a data set, (2, 4, 4, 7, 8) The median Central score in sample with an odd number of items, (2, 4, 4, 7, 8) Average of two central scores in sample with an even number of items (2, 4, 4,7, 8, 100) 14
Measures of central tendency (cont.) The mean The average of all scores in a data set, (2,4,4,7,8) Disadvantage of the mean Affected by extreme values (2,4,4,7,100) What is a more suitable measure in such cases? 15
Measures of variability Statistical dispersion in a r.v. or probability distribution Range Highest value minus lowest value: (2,4,4,7,8) Affected by extreme scores: (2,4,4,7,100) Inter-quartile range: difference between The value ¼ of the way from the top, and The value ¼ of the way from the bottom Semi inter-quartile range: ½ of the IQR 16
Measures of variability (cont.) The variance Considers distance of every data item from mean Population variance Sample variance: (n-1) indicates degree of freedom 17
Measures of variability (cont.) The standard deviation The most common measure of statistical dispersion Standard deviation of a random variable Sample standard deviation: N-1 indicates d.o.f. 18
Shape of a distribution Asymmetrical distribution Positively (or right) skewed distribution Negatively (or left) skewed distribution Symmetrical distribution Normal distribution (single modal peak) mode=median=mean Assumed by many statistical tests in corpus linguistics Bimodal distribution 19
Normal distribution A statistical distribution N(μ, σ) with the following probability density function Parameters: mean μ and variance σ e is a mathematical constant Density is bell-shaped, symmetric about μ Standard normal distribution: μ=0, σ=1 20
Central limit theorem The theorem When samples are repeatedly drawn from a population, the means of the samples will be normally distributed around the population mean This occurs even if the distribution of the data in the population is not normal This makes the normal distribution important The distribution of IQ scores 21
Properties of the normal curve Shape of curve defined by μandσ Important property For any normal curve, if we draw a vertical line through it at any number of standard deviations away from the mean, the proportions of the area under the curve are always the same See here 22
The z score A measure of how far a given value is from the mean, expressed as a number of s.d.’s How probable a z score is for any test Measured by proportion of the total area under the tail of the curve which lies beyond a given z value Consult the z score table 23
Example 4 Mean frequency of there in a 1000-word sample written by a given author is 10, σ = 4 A sample contains 17 occurrences of there z score = (17-10)/4= 1.75 The area beyond the z score of 1.75 is 0.0401, or 4.01% of the total area under the curve The probability of seeing a sample with more than 17 occurrences of there is 4.01% or less 24
Hypothesis testing Using descriptive statistics as evidence for or against experimental hypotheses The null hypothesis H0 There is no difference between the sample value and the population from which it was drawn The alternative hypothesis H1 The is a significant difference between the sample value and the population from which it was drawn Goal: to reject H0 with a certain level of significance (e.g., 5%) 25
Hypothesis testing (cont.) Use of statistical tests Estimates the probability that the claims are wrong Enables us to claim statistical significance for our results and have confidence in our claims One and two-tailed tests One-tailed: likely direction of difference known Two-tailed: nature of the difference not specified If using z-score, proportions in Appendix 1 must be doubled 26
Comparing Groups Xiaofei Lu APLNG 596D
Outline Basic concepts Parametric comparisons of two groups Non-parametric comparisons of two groups Comparisons between three or more groups 28
Basic concepts Types of scales of measurement Independent and dependent variables Parametric and non-parametric tests Population mean Between-groups and repeated measures design One-sample and two-sample studies 29
Types of scales of measurement Ratio scale: units on the scale are the same Measurement in meters Interval scale: the zero point is arbitrary Centigrade scale of temperature Ordinal scale: records order only Ranks in a contest Nominal scale: categorical data Part-of-speech categories 30
Independent and dependent variables Independent variables: what do I change? Dependent variables: what do I observe? Controlled variables: what do I keep the same? 31
Two examples Effect of education on income Independent variable: academic degree of the individual Dependent variable: level of income of the individual measured in monetary units Effect of sentence complexity on recall Independent variable: sentence complexity Dependent variable: amount of sentence correctly recalled 32
Parametric tests Dependent variables are ratio-/interval- scored Observations should be independent Often assumes normal distribution of data Mean an appropriate measure of central tendency Standard deviation an appropriate measure of variability Works with any distribution with parameters 33
Non-parametric tests Do not assume normal distribution of data Best for small samples with no normal distribution Work with rank-ordered scales and frequencies 34
Population mean Sampling distribution of means A distribution made up of group means Describes a symmetric curve Group means within a population closer to each other than individual scores to group mean Population mean The average of a group of means 35
Experimental design Between-groups design Data comes from two different groups Repeated measures design Data is the result of two or more measures taken from the same group 36
One-sample and two-sample studies One-sample studies Compare group mean with population mean Determine whether group mean differs from population mean Two-sample studies Compare means from two different groups (experimental and control group) Determine whether these means differ for reasons other than pure chance 37
Parametric comparison of two groups The t test for independent samples The matched pairs t test 38
The t test for independent samples Tests difference between two groups Normally-distributed interval data Mean and standard deviation good measures of central tendency and variability Especially useful for small samples (N<30) 39
One-sample t test H0: no significant difference between group mean and population mean Computing the t statistic (in SPSS) Standard error of the means s: standard deviation of the sample group n: sample size 40
Corpus linguistics example A balanced corpus Mean verbs per sentence: 2.5; s.d. = 1.2 A 100-sentence specialized subcorpus Mean verbs per sentence: 3.5; s.d. = 1.6 t statistic: (3.5-2.5)/(1.6/10)=6.25 41
Corpus linguistics example (cont.) Consult the t table Two-tailed test (non-directional) Degree of freedom: (n-1) = (100-1) = 99 use next lower value – 90 Significance level: go with 0.05 or 0.01 Critical value: 1.987 (for 0.05) or 2.632 (for 0.01) Observed value of t (6.25) is greater than 2.632 Can reject H0 at the 1 percent significance level 42
Two-sample t test H0: difference between 2 groups expected for any 2 means in a population due to chance Show that the difference falls in the extreme left or right tail of the t distribution Standard error of differences between the mean 43
Corpus linguistics example Number of errors of a specific type in each of 15 equal-length essays Control group: 8 essays produced by students learning by traditional methods Experimental group: 7 essays produced by students learning by a novel method 44
Corpus linguistics example (cont.) t=(6-3)/sqrt((2.27*2.27/7)+(2.21*2.21/8))=2.584 Degree of freedom = (8-1)+(7-1)=13 Critical value of t for a two-tailed test at the 5 percent significance level for 13 d.o.f. is 2.16 Observed t is greater than 2.16; difference is significant 45
Some caveats The matched pairs t test should be used for repeated measures designs (correlated samples) A non-parametric test should be used if data is very skewed and not normally-distributed A parametric test for comparing 3 or more groups should be used to cross-compare groups 46
The matched pairs t test Comparing paired or correlated samples Not independent but closer to each than random samples A feature observed under 2 different conditions Same students tested before and after taking class Pairs of subjects matched according to any characteristic Studying husbands and wives rather than random samples 47
The matched pairs t test (cont.) didenotes the difference between the ith pair N denotes the number of pairs of observations 48
Corpus linguistics example Lengths of the vowels produced by 10 speakers in two different consonant environments t = -2.95; d.o.f. = 9 Critical value of t for a two-tailed test at the 2 percent significance level for 9 d.o.f. is 2.821 49
Non-parametric comparisons of two groups Used in two-sample studies where the assumptions of the t test do not hold Between-group design (independent samples) The Wilcoxon rank sums test Repeated measures design (correlated/paired samples) The Wilcoxon matched pairs signed rank test 50