230 likes | 390 Views
Collocation. 발표자 : 이도관. Contents. Introduction Frequency Mean & Variance Hypothesis Testing Mutual Information. Collocation. 1. Introduction. Definition
E N D
Collocation 발표자: 이도관
Contents • Introduction • Frequency • Mean & Variance • Hypothesis Testing • Mutual Information
Collocation 1. Introduction • Definition A sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. • 특징 - Non-compositionality Ex) white wine, white hair, white woman • Non-substitutability Ex) white wine vs. yellow wine • Non-modifiability Ex) as poor as church mice vs. as poor as a church mouse
Frequency(1) 2. Frequency • simplest method for finding collocations • counting word frequency 단순히 frequency에 의존할 경우 C(W1,W2) W1 W2 80871 of the 58841 in the 26430 to the …. …. …
Frequency(2) 2. Frequency frequency와 패턴을 같이 사용하는 경우
patterns 2. Frequency Tag Pattern Example A N linear function N N regression coefficient A A N Gaussian random variable A N N cumulative distribution function N A N mean squared error N N N class probability function N P N degrees of freedom
property 2. Frequency • 장점 - 간단하면서 비교적 좋은 결과를 얻는다. - 특히 fixed phrase에 좋다. • 단점 - 정확한 결과를 얻을 수 없다. Ex) 웹 페이지에서 powerful tea가 17번 검색됨. - Fixed phrase가 아니면 적용하기 어렵다. Ex) knock과 door
Mean & Variance 3. Mean & Variance • finding collocations consist of two words that stand more flexible relationship to another • They knocked at the door • A man knocked on the metal front door • mean distance & variance between two words • low deviation : good candidate for collocation
Tools 3. Mean & Variance • relative position • mean : average offset • variance • collocation window : local phenomenon knock door knock door
example 3. Mean & Variance • position of strong with respect to for • d = -1.12 s = 2.15 -4 -3 -2 -1 0 1 2 3 4
property 3. Mean & Variance • 장점 good for finding collocation which has - looser relationship between words - intervening material and relative position • 단점 compositions like‘new company’could be selected for the candidate of collocation
Hypothesis Testing 4. Hypothesis Test • to avoid selecting a lot of words co-occurring just by chance • ‘new company’ : just composition • H0(null hypothesis) : no association between the words • p(w1 w2) = p(w1)p(w2) • t test, test of difference, chi-square test, likelihood ratios,
t test 4. Hypothesis Test • t statistic • tell us how likely one is to get a sample of that mean and variance • probabilistic parsing, word sense disambiguation
t test example 4. Hypothesis Test • t test applied to 10 bigrams (freq. 20) • significant level : 0.005 2.576 • can reject above 2 candidates’s H0
Hypo. test of diff. 4. Hypothesis Test • to find words whose co-occurrence patterns best distinguish between two words. • ‘strong’& ‘powerful’ • t score • H0 : average difference is 0( )
difference test example 4. Hypothesis Test • powerful & strong • strong : intrinsic quality • powerful : power to move things
chi-square test 4. Hypothesis Test • do not assume normal distribution • t test : assumes normal distribution • compare expected & observed frequencies • if diff. Is large : can reject H0(independence) • to identify translation pairs in aligned corpora • chi-square statistic
chi-square example 4. Hypothesis Test • ‘new companies’ • significant level : 0.005 3.841 • t = 1.55 : cannot reject H0
Likelihood ratios 4. Hypothesis Test • sparse data than chi-square test • more interpretable than chi-square test • Hypothesis • H1 : p(w2|w1)=p=p(w2|~w1) • H2 : p(w2|w1)=p1 != p2=p(w2|w1) • p = c2/N, p1 = c12/c1 , p2 = (c2-c12)/(N-c1) • likelihood ratio(pp. 173)
Likelihood ratios (2) 4. Hypothesis Test • table 5.12(pp. 174) • ‘powerful computers’ is 1.3E18 times more likely than its base rate of occurrence would suggest • relative frequency ratio. • Relative frequencies between two or more diff. Corpora. • useful for subject-specific collocation • Table 5.13(pp. 176) • Karim Obeid (1990 vs. 1989) : 0.0241
Mutual Information 5. Mutual Info. • tells us how much one word about the other • ex) table 5.14(pp. 178) • I(Ayatollah,Ruhollah) = 18.38 • Ayatollah at pos. i increase by 18.38 if Ruhollah occurs at pos. i+1
Mutual Information(2) 5. Mutual Info. • good measure of independence • bad measure of dependence • perfect dependence • perfect independence
Mutual Information(3) 5. Mutual Info. • 장점 - 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수 있다. - 간단하면서 더 정확한 개념을 전달한다. • 장점 - frequency가 작은 sparse data의 경우 결과가 잘못 나올 수 있다.