840 likes | 1.12k Views
Simple Statistics for Corpus Linguistics. Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Outline. Numbers… A simple research question do women speak or write more than men in ICE-GB? p = proportion = probability Another research question
E N D
Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk
Outline • Numbers… • A simple research question • do women speak or write more than menin ICE-GB? • p = proportion = probability • Another research question • what happens to speakers’ use of modal shallvs. willover time? • the idea of inferential statistics • plotting confidence intervals • Concluding remarks
Numbers... • We are used to concepts like these being expressed as numbers: • length (distance, height) • area • volume • temperature • wealth (income, assets)
Numbers... • We are used to concepts like these being expressed as numbers: • length (distance, height) • area • volume • temperature • wealth (income, assets) • We are going to discuss another concept: • probability • proportion, percentage • a simple idea, at the heart of statistics
Probability • Based on another, even simpler, idea: • probability p = x / n
Probability • Based on another, even simpler, idea: • probability p = x / n • e.g. the probability that the speaker says willinstead of shall
Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • e.g. the probability that the speaker says willinstead of shall
Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • e.g. the probability that the speaker says willinstead of shall • cases of will
Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • e.g. the probability that the speaker says willinstead of shall • cases of will
Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • e.g. the probability that the speaker says willinstead of shall • cases of will • total: will + shall
Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • Probability can range from 0 to 1 • e.g. the probability that the speaker says willinstead of shall • cases of will • total: will + shall
What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation
What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation } How do these differ in what they might tell us?
What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language } How do these differ in what they might tell us?
What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language, varying by: • source (e.g. speech vs. writing, age...) • levels of annotation (e.g. parsing) • size(number of words) • sampling method (random sample?) } How do these differ in what they might tell us?
What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language, varying by: • source (e.g. speech vs. writing, age...) • levels of annotation (e.g. parsing) • size(number of words) • sampling method (random sample?) } How do these differ in what they might tell us? How does this affect the types of knowledge we might obtain? }
What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus:
What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event - How often?
What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. - How often? - How novel?
What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. • Interaction evidence of relationshipsbetween rules, structures and events - How often? - How novel? - Does X affect Y?
What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. • Interaction evidence of relationshipsbetween rules, structures and events • Lexical searches may also be made more precise using the grammatical analysis - How often? - How novel? - Does X affect Y?
A simple research question • Let us consider the following question: • Do women speak or write more words than men in the ICE-GB corpus? • What do you think? • How might we find out?
Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation
Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written
Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written • Variable query: • SPEAKER GENDER = f, m, <unknown> } combine these3 queries
Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written • Variable query: • SPEAKER GENDER = f, m, <unknown> } combine these3 queries
ICE-GB: gender / written-spoken • Proportion of words in each category spoken/written by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly female written male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1
ICE-GB: gender / written-spoken • Proportion of words in each category spoken/written by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly female written p(female) = words spoken by women /total words (excluding <unknown>) male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1
p = Probability = Proportion • We asked ourselves the following question: • Do women speak or write more words than men in the ICE-GB corpus? • To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known)
p = Probability = Proportion • We asked ourselves the following question: • Do women speak or write more words than men in the ICE-GB corpus? • To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known) • The proportion of words produced by women can also be thought of as a probability: • What is the probability that, if we were to pick any random word in ICE-GB (and the gender was known) it would be uttered by a woman?
Another research question • Let us consider the following question: • What happens to modal shallvs. willover time in British English? • Does shallincrease or decrease? • What do you think? • How might we find out?
Lets get some data • Open DCPSE with ICECUP • FTF query for first person declarative shall: • repeat for will
Lets get some data • Open DCPSE with ICECUP • FTF query for first person declarative shall: • repeat for will • Corpus Map: • DATE } Do the first set of queries and then drop into Corpus Map
Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)
Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)
Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 Is shallgoing up or down? 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)
Is shall going up or down? • Whenever we look at change, we must ask ourselves two things:
Is shall going up or down? • Whenever we look at change, we must ask ourselves two things: • What is the change relative to? • Is our observation higher or lower than we might expect? • In this case we ask • Does shalldecrease relative to shall +will?
Is shall going up or down? • Whenever we look at change, we must ask ourselves two things: • What is the change relative to? • Is our observation higher or lower than we might expect? • In this case we ask • Does shalldecrease relative to shall +will? • How confident are we in our results? • Is the change big enough to be reproducible?
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning Really? Not 77.28, or 77.26?
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% of uses of think in 1920s data have a literal (‘cogitate’) meaning
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% of uses of think in 1920s data have a literal (‘cogitate’) meaning Sounds defensible. But how confident can we be in this number?
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning
The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning Finally we have a credible range of values - needs a footnote* to explain how it was calculated.
The ‘sample’ and the ‘population’ • We said that the corpus was a sample
The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact
The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact • Now we are asking about “British English” ?
The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact • Now we are asking about “British English” • We want to draw an inference • from the sample(in this case, DCPSE) • to the population (similarly-sampled BrE utterances) • This inference is a best guess • This process is called inferential statistics
Basic inferential statistics • Suppose we carry out an experiment • We toss a coin 10 times and get 5 heads • How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again?