1 / 23

Collocation

Collocation. 발표자 : 이도관. Contents. Introduction Frequency Mean & Variance Hypothesis Testing Mutual Information. Collocation. 1. Introduction. Definition

Download Presentation

Collocation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collocation 발표자: 이도관

  2. Contents • Introduction • Frequency • Mean & Variance • Hypothesis Testing • Mutual Information

  3. Collocation 1. Introduction • Definition A sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. • 특징 - Non-compositionality Ex) white wine, white hair, white woman • Non-substitutability Ex) white wine vs. yellow wine • Non-modifiability Ex) as poor as church mice vs. as poor as a church mouse

  4. Frequency(1) 2. Frequency • simplest method for finding collocations • counting word frequency 단순히 frequency에 의존할 경우 C(W1,W2) W1 W2 80871 of the 58841 in the 26430 to the …. …. …

  5. Frequency(2) 2. Frequency frequency와 패턴을 같이 사용하는 경우

  6. patterns 2. Frequency Tag Pattern Example A N linear function N N regression coefficient A A N Gaussian random variable A N N cumulative distribution function N A N mean squared error N N N class probability function N P N degrees of freedom

  7. property 2. Frequency • 장점 - 간단하면서 비교적 좋은 결과를 얻는다. - 특히 fixed phrase에 좋다. • 단점 - 정확한 결과를 얻을 수 없다. Ex) 웹 페이지에서 powerful tea가 17번 검색됨. - Fixed phrase가 아니면 적용하기 어렵다. Ex) knock과 door

  8. Mean & Variance 3. Mean & Variance • finding collocations consist of two words that stand more flexible relationship to another • They knocked at the door • A man knocked on the metal front door • mean distance & variance between two words • low deviation : good candidate for collocation

  9. Tools 3. Mean & Variance • relative position • mean : average offset • variance • collocation window : local phenomenon knock door knock door

  10. example 3. Mean & Variance • position of strong with respect to for • d = -1.12 s = 2.15 -4 -3 -2 -1 0 1 2 3 4

  11. property 3. Mean & Variance • 장점 good for finding collocation which has - looser relationship between words - intervening material and relative position • 단점 compositions like‘new company’could be selected for the candidate of collocation

  12. Hypothesis Testing 4. Hypothesis Test • to avoid selecting a lot of words co-occurring just by chance • ‘new company’ : just composition • H0(null hypothesis) : no association between the words • p(w1 w2) = p(w1)p(w2) • t test, test of difference, chi-square test, likelihood ratios,

  13. t test 4. Hypothesis Test • t statistic • tell us how likely one is to get a sample of that mean and variance • probabilistic parsing, word sense disambiguation

  14. t test example 4. Hypothesis Test • t test applied to 10 bigrams (freq. 20) • significant level : 0.005 2.576 • can reject above 2 candidates’s H0

  15. Hypo. test of diff. 4. Hypothesis Test • to find words whose co-occurrence patterns best distinguish between two words. • ‘strong’& ‘powerful’ • t score • H0 : average difference is 0( )

  16. difference test example 4. Hypothesis Test • powerful & strong • strong : intrinsic quality • powerful : power to move things

  17. chi-square test 4. Hypothesis Test • do not assume normal distribution • t test : assumes normal distribution • compare expected & observed frequencies • if diff. Is large : can reject H0(independence) • to identify translation pairs in aligned corpora • chi-square statistic

  18. chi-square example 4. Hypothesis Test • ‘new companies’ • significant level : 0.005 3.841 • t = 1.55 : cannot reject H0

  19. Likelihood ratios 4. Hypothesis Test • sparse data than chi-square test • more interpretable than chi-square test • Hypothesis • H1 : p(w2|w1)=p=p(w2|~w1) • H2 : p(w2|w1)=p1 != p2=p(w2|w1) • p = c2/N, p1 = c12/c1 , p2 = (c2-c12)/(N-c1) • likelihood ratio(pp. 173)

  20. Likelihood ratios (2) 4. Hypothesis Test • table 5.12(pp. 174) • ‘powerful computers’ is 1.3E18 times more likely than its base rate of occurrence would suggest • relative frequency ratio. • Relative frequencies between two or more diff. Corpora. • useful for subject-specific collocation • Table 5.13(pp. 176) • Karim Obeid (1990 vs. 1989) : 0.0241

  21. Mutual Information 5. Mutual Info. • tells us how much one word about the other • ex) table 5.14(pp. 178) • I(Ayatollah,Ruhollah) = 18.38 • Ayatollah at pos. i increase by 18.38 if Ruhollah occurs at pos. i+1

  22. Mutual Information(2) 5. Mutual Info. • good measure of independence • bad measure of dependence • perfect dependence • perfect independence

  23. Mutual Information(3) 5. Mutual Info. • 장점 - 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수 있다. - 간단하면서 더 정확한 개념을 전달한다. • 장점 - frequency가 작은 sparse data의 경우 결과가 잘못 나올 수 있다.

More Related