1 / 27

Corpora and Statistical Methods

Corpora and Statistical Methods. Albert Gatt. In this lecture. We have considered distributions of words and lexical variation in corpora. Today we consider collocations: definition and characteristics measures of collocational strength experiments on corpora hypothesis testing. Part 1 .

adina
Download Presentation

Corpora and Statistical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods Albert Gatt

  2. In this lecture • We have considered distributions of words and lexical variation in corpora. • Today we consider collocations: • definition and characteristics • measures of collocational strength • experiments on corpora • hypothesis testing Corpora and Statistical Methods

  3. Part 1 Collocations: Definition and characteristics

  4. A motivating example • Consider phrases such as: • strong tea ? powerful tea • strong support ? powerful support • powerful drug ? strong drug • Traditional semantic theories have difficulty accounting for these patterns. • strong and powerful seem near-synonyms • do we claim they have different senses? • what is the crucial difference? Corpora and Statistical Methods

  5. The empiricist view of meaning • Firth’s view (1957): • “You shall know a word by the company it keeps” • This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953). • In the Firthian tradition, attention is paid to patterns that crop up with regularity in language. • Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc. • Statistical work on collocations tends to follow this tradition. Corpora and Statistical Methods

  6. Defining collocations • “Collocations … are statements of the habitual or customary places of [a] word.” (Firth 1957) • Characteristics/Expectations: • regular/frequently attested; • occur within a narrow window (span of few words); • not fully compositional; • non-substitutable; • non-modifiable • display category restrictions Corpora and Statistical Methods

  7. Frequency and regularity • We know that language is regular (non-random) and rule-based. • this aspect is emphasised by rationalist approaches to grammar • We also need to acknowledge that frequency of usage is an important factor in language development. • why do big and large collocate differently with different nouns? Corpora and Statistical Methods

  8. Regularity/frequency • f(strong tea) > f(powerful tea) • f(credit card) > f(credit bankruptcy) • f(white wine) > f(yellow wine) • (even though white wine is actually yellowish) Corpora and Statistical Methods

  9. Narrow window (textual proximity) • Usually, we specify an n-gram window within which to analyse collocations: • bigram: credit card, credit crunch • trigram: credit card fraud, credit card expiry • … • The idea is to look at co-occurrence of words within a specific n-gram window • We can also count n-grams with intervening words: • federal (.*) subsidy • matches: federal subsidy, federal farm subsidy, federal manufacturing subsidy… Corpora and Statistical Methods

  10. Textual proximity (continued) • Usually collocates of a word occur close to that word. • may still occur across a span • Examples: • bigram: white wine, powerful tea • >bigram:knock on the door;knock on X’s door Corpora and Statistical Methods

  11. Non-compositionality • white wine • not really “white”, meaning not fully predictable from component words + syntax • signal interpretation • a term used in Intelligent Signal Processing: connotations go beyond compositional meaning • Similarly: • regression coefficient • good practice guidelines • Extreme cases: • idioms such as kick the bucket • meaning is completely frozen Corpora and Statistical Methods

  12. Non-substitutability • If a phrase is a collocation, we can’t substitute a word in the phrase for a near-synonym, and still have the same overall meaning. • E.g.: • white wine vs. yellow wine • powerful tea vs. strong tea • … Corpora and Statistical Methods

  13. Non-modifiability • Often, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms. • Example: • kick the bucket vs. ?kick the large bucket • NB: • this is a matter of degree! • non-idiomatic collocations are more flexible Corpora and Statistical Methods

  14. Category restrictions • Frequency alone doesn’t indicate collocational strength: • by the is a very frequent phrase in English • not a collocation • Collocations tend to be formed from content words: • A+N: powerful tea • N+N: regression coefficient, mass demonstration • N+PREP+N: degrees of freedom Corpora and Statistical Methods

  15. Collocations “in a broad sense” • In many statistical NLP applications, the term collocation is quite broadly understood: • any phrase which is frequent/regular enough… • proper names (New York) • compound nouns (elevator operator) • set phrases (part of speech) • idioms (kick the bucket) Corpora and Statistical Methods

  16. Why are collocations interesting? • Several applications need to “know” about collocations: • terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices) • document classification: specialist phrases are good indicators of the topic of a text • named entity recognition: names such as New York tend to occur together frequently; phrases like new toy don’t Corpora and Statistical Methods

  17. Example application: Parsing • She spotted the man with a pair of binoculars • [VP spotted [NP the man [PP with a pair of binoculars]]] • [VP spotted [NP the man] [PP with a pair of binoculars]] • Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width. Corpora and Statistical Methods

  18. Example application: Generation • NLG systems often need to map a semantic representation to a lexical/syntactic one. • Shouldn’t use the wrong adjective-noun combinations: clean face vs. ?immaculate face • Lapataet al. (1999): • experiment asking people to rate different adjective-noun combinations • frequency of the combination a strong predictor of people’s preferences • argue that NLG systems need to be able to make contextually-informed decisions in lexical choice Corpora and Statistical Methods

  19. Finding collocations in corpora: basic methods

  20. Frequency-based approach • Motivation: • if two (or three, or…) words occur together a lot within some window, they’re a collocation • Problems: • frequent “collocations” under this definition include with the, onto a, etc. • not very interesting… Corpora and Statistical Methods

  21. Improving the frequency-based approach • Justeson & Katz (1995): • part of speech filter • only look at word combinations of the “right” category: • N + N: regression coefficient • N + PRP + N: jack in (the) box • … • dramatically improves the results • content-word combinations more likely to be phrases Corpora and Statistical Methods

  22. Case study: strong vs. powerful • See: Manning & Schutze `99, Sec 5.2 • Motivation: • try to distinguish the meanings of two quasi-synonyms • data from New York Times corpus • Basic strategy: • find all bigrams <w1, w2> where w1 = strong or powerful • apply POS filter to remove strong on [crime], powerful in [industry] etc. Corpora and Statistical Methods

  23. Case study (cont/d) • Sample results from Manning & Schutze `99: • f(strong support) = 50 • f(strong supporter) = 10 • f(powerful force) = 13 • f(powerful computers) = 10 • Teaser: • would you also expect powerful supporter? • what’s the difference between strong supporter and powerful supporter? Corpora and Statistical Methods

  24. Limitations of frequency-based search • Only work for fixed phrases • But collocations can be “looser”, allowing interpolation of other words. • knock on [the,X’s,a] door • pull [a] punch • Simple frequency won’t do for these: different interpolated words dilute the frequency. Corpora and Statistical Methods

  25. Using mean and variance • General idea: include bigrams even at a distance: w1 X w2 pull a punch • Strategy: • find co-occurrences of the two words in windows of varying length • compute mean offset between w1 and w2 • compute variance of offset between w1 and w2 • if offsets are randomly distributed, then we have high variance and conclude that <w1,w2> is not a collocation Corpora and Statistical Methods

  26. Example outcomes (M&S `99) • position of strong wrtopposition • mean = -1.15, standard dev = 0.67 • i.e. most occurrences are strong […] opposition • position of strong wrt for • mean = -1.12, standard dev = 2.15 • i.e. for occurs anywhere around strong, SD is higher than mean. • can get strong support for, for the strong support, etc. Corpora and Statistical Methods

  27. More limitations of frequency • If we use simple frequency or mean & variance, we have a good way of ranking likely collocations. • But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance? • We need to think in terms of hypothesis-testing. • Given <w1,w2>, we want to compare: • The hypothesis that they are non-independent. • The hypothesis that they are independent. Corpora and Statistical Methods

More Related