170 likes | 258 Views
Measures of Coincidence. Vasileios Hatzivassiloglou University of Texas at Dallas. A study of different measures. Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach Use aligned parallel corpora (Hansards)
E N D
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas
A study of different measures • Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach • Use aligned parallel corpora (Hansards) • Task: Find translation for a word group across languages
Sketch of algorithm • Start with set of collocations in French • Find candidate single word translations according to association between original collocation and translation • Measure association between source collocation and pairs of candidate words • Expand iteratively to triplets, etc. by recalculating association
Dice vs. SI • Dice depends on conditional probabilities only • SI depends on the marginals: logP(X|Y)-logP(X) • SI depends on how rare X is • Limit behavior
Asymmetry • Many kinds of asymmetry • Between X and Y • Between X=1 and X=0 • 1-1 matches versus 0-0 matches • Adding 0-0 matches does not change Dice • Adding 0-0 matches always increases SI
Effect of asymmetry • Hypothetical scenario on 100 sentences • A,B appear together twice, by themselves three times each • Dice: 2×2 / (5+5) = 0.4 • SI: log (0.02 / (0.05×0.05)) = 3 bits • MI: 0.0457 bits
Reversing one and zeroes • Now replace every 1 with 0 and vice versa • New variables A′, B′ occur together 92 times, each occurs by itself three times • Dice: 2×92 / (95 + 95) = 0.9684 • MI: Unchanged (0.0457 bits) • SI: log(0.92 / (0.95×0.95)) = 0.0277 bits
Explaining the behavior • Limit effect as P(X) decreases with P(X|Y) constant • P(X) eventually dominates SI • Makes SI (and MI) more sensitive to estimation errors
Bounds and testing purpose • No upper bound for SI and MI • Dice is always between 0 and 1 • Easy to test SI/MI for independence • Easy to test Dice for correlation
Empirical comparison • How to compare without redoing the entire experiment? • Solution: Use competing measure in the last round • Test cases where the correct solution is available • Provide lower bound on competitor error
Empirical results • 45 French collocations • 2 did not produce any candidate translation • Dice resulted in 36 correct, 7 incorrect translations • SI resulted in 26 correct, 17 incorrect translations
Re-examining contingency tables • Ted Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence”, Computational Linguistics, 1993. • Problem: Asymptotic normality assumptions • How much data is enough? • Are researchers aware of the need for statistical validity analysis?
Rarity of words • Empirical counts on words show that 20–30% of words appear less than 1 in 50,000 words • Estimating binomial as normal: Good as long as np(1-p) > 5 • Significance overestimated by 20% for np=1,40 for np=0.1, 1020 for np=0.01
Likelihood in parameter spaces • Parametric model (known except for parameter values) • Likelihood function H(ω;k) • Hypothesis represented by a point ω0
Likelihood ratio • Test statistic: -2logλ • Rapidly approaches χ2 distribution for binomial H
Comparing to chi-square • Leads to same formula as Pearson’s chi-square statistic when approximating with normal distribution • Diverges significantly from chi-square for low np • Closely follows chi-square distribution
Experimental results • 32,000 words of financial text from Switzerland • Find highly correlated word pairs • Observe top-ranked entries for log-likelihood and chi-square • Chi-square leads to huge scores for rare pairs • 2,682 of 2,693 bigrams violate assumptions