1 / 94

Collocations

Collocations. Definition Of Collocation (wrt Corpus Literature).

lilah
Download Presentation

Collocations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collocations

  2. Definition Of Collocation (wrt Corpus Literature) • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

  3. Word Collocations • Collocation • Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word. • non-compositionality of meaning • cannot be derived directly from its parts (heavy rain) • non-substitutability in context • for parts (make a decision) • non-modifiability (& non-transformability) • kick the yellow bucket; take exceptions to

  4. Collocations • Collocations are not necessarily adjacent • Collocations cannot be directly translated into other languages.

  5. Example Classes • Names • Technical Terms • “Light” Verb Constructions • Phrasal verbs • Noun Phrases

  6. Linguistic Subclasses of Collocations • Light verbs: verbs with little semantic content like make, take, do • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive) • Idioms: fixed phrases • kick the bucket, birds-of-a-feather, run for office • Proper names: difficult to recognize even with lists • Tuesday (person’s name), May, Winston Churchill, IBM, Inc. • Numerical expressions • containing “ordinary” words • Monday Oct 04 1999, two thousand seven hundred fifty • Verb particle constructions or Phrasal Verbs • Separable parts: • look up, take off, tell off

  7. Collocation Detection Techniques • Selection of Collocations by Frequency • Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word. • Hypothesis Testing • Pointwise Mutual Information

  8. Frequency • Technique: • Count the number of times a bigram co-occurs • Extract top counts and report them as candidates • Results: • Corpus: New York Times • August – November, 1990 • Extremely un-interesting

  9. Frequency with Tag Filters Technique • Technique: • Count the number of times a bigram co-occurs • Tag candidates for POS • Pass all candidates through POS filter, considering only ones matching filter • Extract top counts and report them as candidates

  10. Frequency with Tag Filters Results

  11. Mean and Variance (Smadja et al., 1993) • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example, • Knock and door may not occur at a fixed distance from each other • One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.

  12. Mean, Sample Variance, and Standard Deviation

  13. Example: Knock and Door • She knocked on his door. • They knocked at the door. • 100 women knocked on the big red door. • A man knocked on the metal front door. • Average offset between knock and door: (3 + 3 + 5 + 5)/ 4 = 4 • Variance: ((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15

  14. Mean and Variance • Technique (bigram at distance) • Produce all possible pairs in a window • Consider all pairs in window as candidates • Keep data about distance of one word from another • Count the number of time each candidate occurs • Measures: • Mean: average offset (possibly negative) • Whether two words are related to each other • Variance: s(offset) • Variability in position of two words

  15. Mean and Variance Illustration • Candidate Generation example: • Window: 3 • Used to find collocations with long-distance relationships

  16. Mean and Variance Collocations

  17. Hypothesis Testing: Overview • Two (or more) words co-occur a lot • Is a candidate a true collocation, or a (not-at-all-interesting) phantom?

  18. The t test Intuition • Intuition: • Compute chance occurrence and ensure observed is significantly higher • Take several permutations of the words in the corpus • How more frequent is the set of all possible permutations than what is observed? • Assumptions: • H0 is the null hypothesis (words occur independently) • P(w1, w2) = P(w1) P(w2) • Distribution is “normal”

  19. The t test Formula • Measures: • x = bigram count • m = H0 = P(w1) P(w2) • s2 = bigram count (since p ~ p[1 – p]) • N = total number of bigrams • Result: • Number to look up in a table • Degree of confidence that collocation is not created by chance • a = the confidence (%) with which one can reject H0

  20. The t test Sample Findings

  21. The t test Criticism • Words are not normally distributed • Can reject valid collocation • Not good on sparse data

  22. c2 Intuition • Pearson’s chi-square test • Intuition • Compare observed frequencies to expected frequencies for independence • Assumptions • If sample is not small, the distribution is not normal

  23. c2 General Formula • Measures: • Eij = Expected count of the bigram • Oij = Observed count of the bigram • Result • A number to look up in a table (like the t test) • Degree of confidence (a) with which H0

  24. c2 Bigram Method and Formula • Technique for Bigrams: • Arrange the bigrams in a 2x2 table with counts for each • Formula • Oij: i = column; j = row

  25. c2 Sample Findings • Comparing corpora • Machine Translation • Comparison of (English) “cow” and (French) “vache” gives a • c2 = 456400 • Similarity of two corpora

  26. c2 Criticism • Not good for small datasets

  27. Likelihood Ratios Within a Single Corpus (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2: • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2

  28. Likelihood Ratios Within a Single Corpus • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution: • Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N • Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2 • Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2 • Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2

  29. Likelihood Ratios Within a Single Corpus • The likelihood of H1 • L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence) • The likelihood of H2 • L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence) • The log of likelihood ratio • log  = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..) • The quantity –2 log  is asymptotically 2 distributed, so we can test for significance.

  30. [Pointwise] Mutual Information (I) • Intuition: • Given a collocation (w1, w2) and an observation of w1 • I(w1; w2) indicates how more likely it is to see w2 • The same measure also works in reverse (observe w2) • Assumptions: • Data is not sparse

  31. Mutual Information Formula • Measures: • P(w1) = unigram prob. • P(w1w2) = bigram prob. • P (w2|w1) = probability of w2 given we see w1 • Result: • Number indicating increased confidence that we will see w2 after w1

  32. Mutual Information Criticism • A better measure of the independence of two words rather than the dependence of one word on another • Horrible on [read: misidentifies] sparse data

  33. Applications • Collocations are useful in: • Comparison of Corpora • Parsing • New Topic Detection • Computational Lexicography • Natural Language Generation • Machine Translation

  34. Comparison of Corpora • Compare corpora to determine: • Document clustering (for information retrieval) • Plagiarism • Comparison techniques: • Competing hypotheses: • Documents are dependent • Documents are independent • Compare hypotheses using l, etc.

  35. Parsing • When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words) • Example: [ hand to hand ] is a unit in: (S (NP They) (VP engaged (PP in hand) (PP to (NP hand combat))))

  36. New Topic Detection • When new topics are reported, the count of collocations associated with those topics increases • When topics become old, the count drops

  37. Computational Lexicography • As new multi-word expressions become part of the language, they can be detected • Existing collocations can be acquired • Can also be used for cultural identification • Examples: • My friend got an A in his class • My friend took an A in his class • My friend made an A in his class • My friend earned an A in his class

  38. Natural Language Generation • Problem: • Given two (or more) possible productions, which is more feasible? • Productions usually involve synonyms or near-synonyms • Languages generally favour one production

  39. Machine Translation • Collocation-complete problem? • Must find all used collocations • Must parse collocation as a unit • Must translate collocation as a unit • In target language production, must select among many plausible alternatives

  40. Thanks! • Questions?

  41. Statistical Inference: n-gram Model over Sparse Data

  42. Statistical inference • Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.

  43. Language Models • Predict the next word, given the previous words (this sort of task is often referred to as a shannon game) • A language model can take the context into account. • Determine probability of different sequences by examining training corpus • Applications: • OCR / Speech recognition – resolve ambiguity • Spelling correction • Machine translation etc

  44. Statistical Estimators • Example: Corpus: five Jane Austen novels N = 617,091 words, V = 14,585 unique words Task: predict the next word of the trigram “inferior to ___” from test data, Persuasion: “[In person, she was] inferior to both [sisters.]” • Given the observed training data … • How do you develop a model (probability distribution) to predict future events?

  45. The Perfect Language Model • Sequence of word forms • Notation: W = (w1,w2,w3,...,wn) • The big (modeling) question is what is p(W)? • Well, we know (Bayes/chain rule): p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1) • Not practical (even short for W ® too many parameters)

  46. Markov Chain • Unlimited memory (cf. previous foil): • for wi, we know its predecessors w1,w2,w3,...,wi-1 • Limited memory: • we disregard predecessors that are “too old” • remember only k previous words: wi-k,wi-k+1,...,wi-1 • called “kth order Markov approximation” • Stationary character (no change over time): p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|

  47. N-gram Language Models • (n-1)th order Markov approximation ® n-gram LM: p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1) • In particular (assume vocabulary |V| = 20k): 0-gram LM: uniform model p(w) = 1/|V| 1 parameter 1-gram LM: unigram model p(w) 2´104 parameters 2-gram LM: bigram model p(wi|wi-1) 4´108 parameters 3-gram LM: trigram mode p(wi|wi-2,wi-1) 8´1012 parameters 4-gram LM: tetragram model p(wi| wi-3,wi-2,wi-1) 1.6´1017 parameters

  48. Reliability vs. Discrimination “large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? tidbit? • larger n: more information about the context of the specific instance (greater discrimination) • smaller n: more instances in training data, better statistical estimates (more reliability)

  49. LM Observations • How large n? • zero is enough (theoretically) • but anyway: as much as possible (as close to “perfect” model as possible) • empirically: 3 • parameter estimation? (reliability, data availability, storage space, ...) • 4 is too much: |V|=60k ® 1.296´1019 parameters • but: 6-7 would be (almost) ideal (having enough data) • For now, word forms only (no “linguistic” processing)

  50. Parameter Estimation • Parameter: numerical value needed to compute p(w|h) • From data (how else?) • Data preparation: • get rid of formatting etc. (“text cleaning”) • define words (separate but include punctuation, call it “word”, unless speech) • define sentence boundaries (insert “words” <s> and </s>) • letter case: keep, discard, or be smart: • name recognition • number type identification

More Related