1.04k likes | 1.87k Views
Collocations. Definition Of Collocation (wrt Corpus Literature).
E N D
Definition Of Collocation (wrt Corpus Literature) • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]
Word Collocations • Collocation • Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word. • non-compositionality of meaning • cannot be derived directly from its parts (heavy rain) • non-substitutability in context • for parts (make a decision) • non-modifiability (& non-transformability) • kick the yellow bucket; take exceptions to
Collocations • Collocations are not necessarily adjacent • Collocations cannot be directly translated into other languages.
Example Classes • Names • Technical Terms • “Light” Verb Constructions • Phrasal verbs • Noun Phrases
Linguistic Subclasses of Collocations • Light verbs: verbs with little semantic content like make, take, do • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive) • Idioms: fixed phrases • kick the bucket, birds-of-a-feather, run for office • Proper names: difficult to recognize even with lists • Tuesday (person’s name), May, Winston Churchill, IBM, Inc. • Numerical expressions • containing “ordinary” words • Monday Oct 04 1999, two thousand seven hundred fifty • Verb particle constructions or Phrasal Verbs • Separable parts: • look up, take off, tell off
Collocation Detection Techniques • Selection of Collocations by Frequency • Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word. • Hypothesis Testing • Pointwise Mutual Information
Frequency • Technique: • Count the number of times a bigram co-occurs • Extract top counts and report them as candidates • Results: • Corpus: New York Times • August – November, 1990 • Extremely un-interesting
Frequency with Tag Filters Technique • Technique: • Count the number of times a bigram co-occurs • Tag candidates for POS • Pass all candidates through POS filter, considering only ones matching filter • Extract top counts and report them as candidates
Mean and Variance (Smadja et al., 1993) • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example, • Knock and door may not occur at a fixed distance from each other • One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.
Example: Knock and Door • She knocked on his door. • They knocked at the door. • 100 women knocked on the big red door. • A man knocked on the metal front door. • Average offset between knock and door: (3 + 3 + 5 + 5)/ 4 = 4 • Variance: ((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15
Mean and Variance • Technique (bigram at distance) • Produce all possible pairs in a window • Consider all pairs in window as candidates • Keep data about distance of one word from another • Count the number of time each candidate occurs • Measures: • Mean: average offset (possibly negative) • Whether two words are related to each other • Variance: s(offset) • Variability in position of two words
Mean and Variance Illustration • Candidate Generation example: • Window: 3 • Used to find collocations with long-distance relationships
Hypothesis Testing: Overview • Two (or more) words co-occur a lot • Is a candidate a true collocation, or a (not-at-all-interesting) phantom?
The t test Intuition • Intuition: • Compute chance occurrence and ensure observed is significantly higher • Take several permutations of the words in the corpus • How more frequent is the set of all possible permutations than what is observed? • Assumptions: • H0 is the null hypothesis (words occur independently) • P(w1, w2) = P(w1) P(w2) • Distribution is “normal”
The t test Formula • Measures: • x = bigram count • m = H0 = P(w1) P(w2) • s2 = bigram count (since p ~ p[1 – p]) • N = total number of bigrams • Result: • Number to look up in a table • Degree of confidence that collocation is not created by chance • a = the confidence (%) with which one can reject H0
The t test Criticism • Words are not normally distributed • Can reject valid collocation • Not good on sparse data
c2 Intuition • Pearson’s chi-square test • Intuition • Compare observed frequencies to expected frequencies for independence • Assumptions • If sample is not small, the distribution is not normal
c2 General Formula • Measures: • Eij = Expected count of the bigram • Oij = Observed count of the bigram • Result • A number to look up in a table (like the t test) • Degree of confidence (a) with which H0
c2 Bigram Method and Formula • Technique for Bigrams: • Arrange the bigrams in a 2x2 table with counts for each • Formula • Oij: i = column; j = row
c2 Sample Findings • Comparing corpora • Machine Translation • Comparison of (English) “cow” and (French) “vache” gives a • c2 = 456400 • Similarity of two corpora
c2 Criticism • Not good for small datasets
Likelihood Ratios Within a Single Corpus (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2: • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1) P(w2 | w1) = p2
Likelihood Ratios Within a Single Corpus • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution: • Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N • Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2 • Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2 • Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2
Likelihood Ratios Within a Single Corpus • The likelihood of H1 • L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence) • The likelihood of H2 • L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence) • The log of likelihood ratio • log = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..) • The quantity –2 log is asymptotically 2 distributed, so we can test for significance.
[Pointwise] Mutual Information (I) • Intuition: • Given a collocation (w1, w2) and an observation of w1 • I(w1; w2) indicates how more likely it is to see w2 • The same measure also works in reverse (observe w2) • Assumptions: • Data is not sparse
Mutual Information Formula • Measures: • P(w1) = unigram prob. • P(w1w2) = bigram prob. • P (w2|w1) = probability of w2 given we see w1 • Result: • Number indicating increased confidence that we will see w2 after w1
Mutual Information Criticism • A better measure of the independence of two words rather than the dependence of one word on another • Horrible on [read: misidentifies] sparse data
Applications • Collocations are useful in: • Comparison of Corpora • Parsing • New Topic Detection • Computational Lexicography • Natural Language Generation • Machine Translation
Comparison of Corpora • Compare corpora to determine: • Document clustering (for information retrieval) • Plagiarism • Comparison techniques: • Competing hypotheses: • Documents are dependent • Documents are independent • Compare hypotheses using l, etc.
Parsing • When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words) • Example: [ hand to hand ] is a unit in: (S (NP They) (VP engaged (PP in hand) (PP to (NP hand combat))))
New Topic Detection • When new topics are reported, the count of collocations associated with those topics increases • When topics become old, the count drops
Computational Lexicography • As new multi-word expressions become part of the language, they can be detected • Existing collocations can be acquired • Can also be used for cultural identification • Examples: • My friend got an A in his class • My friend took an A in his class • My friend made an A in his class • My friend earned an A in his class
Natural Language Generation • Problem: • Given two (or more) possible productions, which is more feasible? • Productions usually involve synonyms or near-synonyms • Languages generally favour one production
Machine Translation • Collocation-complete problem? • Must find all used collocations • Must parse collocation as a unit • Must translate collocation as a unit • In target language production, must select among many plausible alternatives
Thanks! • Questions?
Statistical inference • Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.
Language Models • Predict the next word, given the previous words (this sort of task is often referred to as a shannon game) • A language model can take the context into account. • Determine probability of different sequences by examining training corpus • Applications: • OCR / Speech recognition – resolve ambiguity • Spelling correction • Machine translation etc
Statistical Estimators • Example: Corpus: five Jane Austen novels N = 617,091 words, V = 14,585 unique words Task: predict the next word of the trigram “inferior to ___” from test data, Persuasion: “[In person, she was] inferior to both [sisters.]” • Given the observed training data … • How do you develop a model (probability distribution) to predict future events?
The Perfect Language Model • Sequence of word forms • Notation: W = (w1,w2,w3,...,wn) • The big (modeling) question is what is p(W)? • Well, we know (Bayes/chain rule): p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1) • Not practical (even short for W ® too many parameters)
Markov Chain • Unlimited memory (cf. previous foil): • for wi, we know its predecessors w1,w2,w3,...,wi-1 • Limited memory: • we disregard predecessors that are “too old” • remember only k previous words: wi-k,wi-k+1,...,wi-1 • called “kth order Markov approximation” • Stationary character (no change over time): p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|
N-gram Language Models • (n-1)th order Markov approximation ® n-gram LM: p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1) • In particular (assume vocabulary |V| = 20k): 0-gram LM: uniform model p(w) = 1/|V| 1 parameter 1-gram LM: unigram model p(w) 2´104 parameters 2-gram LM: bigram model p(wi|wi-1) 4´108 parameters 3-gram LM: trigram mode p(wi|wi-2,wi-1) 8´1012 parameters 4-gram LM: tetragram model p(wi| wi-3,wi-2,wi-1) 1.6´1017 parameters
Reliability vs. Discrimination “large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? tidbit? • larger n: more information about the context of the specific instance (greater discrimination) • smaller n: more instances in training data, better statistical estimates (more reliability)
LM Observations • How large n? • zero is enough (theoretically) • but anyway: as much as possible (as close to “perfect” model as possible) • empirically: 3 • parameter estimation? (reliability, data availability, storage space, ...) • 4 is too much: |V|=60k ® 1.296´1019 parameters • but: 6-7 would be (almost) ideal (having enough data) • For now, word forms only (no “linguistic” processing)
Parameter Estimation • Parameter: numerical value needed to compute p(w|h) • From data (how else?) • Data preparation: • get rid of formatting etc. (“text cleaning”) • define words (separate but include punctuation, call it “word”, unless speech) • define sentence boundaries (insert “words” <s> and </s>) • letter case: keep, discard, or be smart: • name recognition • number type identification