Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

In this lecture • We have considered distributions of words and lexical variation in corpora. • Today we consider collocations: • definition and characteristics • measures of collocational strength • experiments on corpora • hypothesis testing Corpora and Statistical Methods

Part 1 Collocations: Definition and characteristics

A motivating example • Consider phrases such as: • strong tea ? powerful tea • strong support ? powerful support • powerful drug ? strong drug • Traditional semantic theories have difficulty accounting for these patterns. • strong and powerful seem near-synonyms • do we claim they have different senses? • what is the crucial difference? Corpora and Statistical Methods

The empiricist view of meaning • Firth’s view (1957): • “You shall know a word by the company it keeps” • This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953). • In the Firthian tradition, attention is paid to patterns that crop up with regularity in language. • Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc. • Statistical work on collocations tends to follow this tradition. Corpora and Statistical Methods

Defining collocations • “Collocations … are statements of the habitual or customary places of [a] word.” (Firth 1957) • Characteristics/Expectations: • regular/frequently attested; • occur within a narrow window (span of few words); • not fully compositional; • non-substitutable; • non-modifiable • display category restrictions Corpora and Statistical Methods

Frequency and regularity • We know that language is regular (non-random) and rule-based. • this aspect is emphasised by rationalist approaches to grammar • We also need to acknowledge that frequency of usage is an important factor in language development. • why do big and large collocate differently with different nouns? Corpora and Statistical Methods

Regularity/frequency • f(strong tea) > f(powerful tea) • f(credit card) > f(credit bankruptcy) • f(white wine) > f(yellow wine) • (even though white wine is actually yellowish) Corpora and Statistical Methods

Narrow window (textual proximity) • Usually, we specify an n-gram window within which to analyse collocations: • bigram: credit card, credit crunch • trigram: credit card fraud, credit card expiry • … • The idea is to look at co-occurrence of words within a specific n-gram window • We can also count n-grams with intervening words: • federal (.*) subsidy • matches: federal subsidy, federal farm subsidy, federal manufacturing subsidy… Corpora and Statistical Methods

Textual proximity (continued) • Usually collocates of a word occur close to that word. • may still occur across a span • Examples: • bigram: white wine, powerful tea • >bigram:knock on the door;knock on X’s door Corpora and Statistical Methods

Non-compositionality • white wine • not really “white”, meaning not fully predictable from component words + syntax • signal interpretation • a term used in Intelligent Signal Processing: connotations go beyond compositional meaning • Similarly: • regression coefficient • good practice guidelines • Extreme cases: • idioms such as kick the bucket • meaning is completely frozen Corpora and Statistical Methods

Non-substitutability • If a phrase is a collocation, we can’t substitute a word in the phrase for a near-synonym, and still have the same overall meaning. • E.g.: • white wine vs. yellow wine • powerful tea vs. strong tea • … Corpora and Statistical Methods

Non-modifiability • Often, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms. • Example: • kick the bucket vs. ?kick the large bucket • NB: • this is a matter of degree! • non-idiomatic collocations are more flexible Corpora and Statistical Methods

Category restrictions • Frequency alone doesn’t indicate collocational strength: • by the is a very frequent phrase in English • not a collocation • Collocations tend to be formed from content words: • A+N: powerful tea • N+N: regression coefficient, mass demonstration • N+PREP+N: degrees of freedom Corpora and Statistical Methods

Collocations “in a broad sense” • In many statistical NLP applications, the term collocation is quite broadly understood: • any phrase which is frequent/regular enough… • proper names (New York) • compound nouns (elevator operator) • set phrases (part of speech) • idioms (kick the bucket) Corpora and Statistical Methods

Why are collocations interesting? • Several applications need to “know” about collocations: • terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices) • document classification: specialist phrases are good indicators of the topic of a text • named entity recognition: names such as New York tend to occur together frequently; phrases like new toy don’t Corpora and Statistical Methods

Example application: Parsing • She spotted the man with a pair of binoculars • [VP spotted [NP the man [PP with a pair of binoculars]]] • [VP spotted [NP the man] [PP with a pair of binoculars]] • Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width. Corpora and Statistical Methods

Example application: Generation • NLG systems often need to map a semantic representation to a lexical/syntactic one. • Shouldn’t use the wrong adjective-noun combinations: clean face vs. ?immaculate face • Lapataet al. (1999): • experiment asking people to rate different adjective-noun combinations • frequency of the combination a strong predictor of people’s preferences • argue that NLG systems need to be able to make contextually-informed decisions in lexical choice Corpora and Statistical Methods

Finding collocations in corpora: basic methods

Frequency-based approach • Motivation: • if two (or three, or…) words occur together a lot within some window, they’re a collocation • Problems: • frequent “collocations” under this definition include with the, onto a, etc. • not very interesting… Corpora and Statistical Methods

Improving the frequency-based approach • Justeson & Katz (1995): • part of speech filter • only look at word combinations of the “right” category: • N + N: regression coefficient • N + PRP + N: jack in (the) box • … • dramatically improves the results • content-word combinations more likely to be phrases Corpora and Statistical Methods

Case study: strong vs. powerful • See: Manning & Schutze `99, Sec 5.2 • Motivation: • try to distinguish the meanings of two quasi-synonyms • data from New York Times corpus • Basic strategy: • find all bigrams <w1, w2> where w1 = strong or powerful • apply POS filter to remove strong on [crime], powerful in [industry] etc. Corpora and Statistical Methods

Case study (cont/d) • Sample results from Manning & Schutze `99: • f(strong support) = 50 • f(strong supporter) = 10 • f(powerful force) = 13 • f(powerful computers) = 10 • Teaser: • would you also expect powerful supporter? • what’s the difference between strong supporter and powerful supporter? Corpora and Statistical Methods

Limitations of frequency-based search • Only work for fixed phrases • But collocations can be “looser”, allowing interpolation of other words. • knock on [the,X’s,a] door • pull [a] punch • Simple frequency won’t do for these: different interpolated words dilute the frequency. Corpora and Statistical Methods

Using mean and variance • General idea: include bigrams even at a distance: w1 X w2 pull a punch • Strategy: • find co-occurrences of the two words in windows of varying length • compute mean offset between w1 and w2 • compute variance of offset between w1 and w2 • if offsets are randomly distributed, then we have high variance and conclude that <w1,w2> is not a collocation Corpora and Statistical Methods

Example outcomes (M&S `99) • position of strong wrtopposition • mean = -1.15, standard dev = 0.67 • i.e. most occurrences are strong […] opposition • position of strong wrt for • mean = -1.12, standard dev = 2.15 • i.e. for occurs anywhere around strong, SD is higher than mean. • can get strong support for, for the strong support, etc. Corpora and Statistical Methods

More limitations of frequency • If we use simple frequency or mean & variance, we have a good way of ranking likely collocations. • But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance? • We need to think in terms of hypothesis-testing. • Given <w1,w2>, we want to compare: • The hypothesis that they are non-independent. • The hypothesis that they are independent. Corpora and Statistical Methods

Corpora and Statistical Methods