1 / 30

Lecture 22 Word Similarity

Learn about different methods for measuring word similarity, including thesaurus-based and distributional approaches. Understand the limitations and advantages of each method.

Download Presentation

Lecture 22 Word Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 22Word Similarity CSCE 771 Natural Language Processing • Topics • word similarity • Thesaurus based word similarity • Intro. Distributional based word similarity • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 8, 2013

  2. Overview • Last Time (Programming) • Features in NLTK • NL queries  SQL • NLTK support for Interpretations and Models • Propositional and predicate logic support • Prover9 • Today • Last Lectures slides 25-29 • Features in NLTK • Computational Lexical Semantics • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II

  3. ACL Anthology - http://aclweb.org/anthology-new/ • .

  4. Figure 20.8 Summary of Thesaurus Similarity measures

  5. Wordnet similarity functions • path_similarity()? • lch_similarity()? • wup_similarity()? • res_similarity()? • jcn_similarity()? • lin_similarity()?

  6. Examples: but first a Pop Quiz • How do you get hypernyms from wordnet?

  7. Example: P(c) values entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

  8. Example: counts (made-up) entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

  9. Example: P(c) values entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

  10. Example: entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

  11. simLesk(cat, dog) ??? • (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ • (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) • (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs)

  12. Problems with thesaurus-based • don’t always have a thesaurus • Even so problems with recall • missing words • phrases missing • thesauri work less well for verbs and adjectives • less hyponymy structure Distributional Word Similarity D. Jurafsky

  13. Distributional models of meaning • vector-space models of meaning • offer higher recall than hand-built thesauri • less precision probably • intuition Distributional Word Similarity D. Jurafsky

  14. Word Similarity Distributional Methods • 20.31 tezguino example (Nida) • A bottle of tezguino is on the table. • Everybody likes tezguino. • tezguino makes you drunk. • We make tezguino out of corn. • What do you know about tezguino?

  15. Term-document matrix • Collection of documents • Identify collection of important terms, discriminatory terms(words) • Matrix: terms X documents – • term frequency tfw,d = • each document a vector in ZV: • Z= integers; N=natural numbers more accurate but perhaps misleading • Example Distributional Word Similarity D. Jurafsky

  16. Example Term-document matrix • Subset of terms = {battle, soldier, fool, clown} Distributional Word Similarity D. Jurafsky

  17. Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors) • window of 20 words – 10 before 10 after from Brown corpus – words that occur together • non Brown example • The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing … • Small table from the Brown 10 before 10 after

  18. Pointwise Mutual Information • td-idf (inverse document frequency) rating instead of raw counts • idf intuition again – • pointwise mutual information (PMI) • Do events x and y occur more than if they were independent? • PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) • PMI between words • Positive PMI between two words (PPMI)

  19. Computing PPMI • Matrix F with W (words) rows and C (contexts) columns • fij is frequency of wi in cj,

  20. Example computing PPMI • Need counts so lets make up some • we need to edit this table to have counts

  21. Associations • PMI-assoc • assocPMI(w, f) = log2 P(w,f) / P(w) P(f) • Lin- assoc - f composed of r (relation) and w’ • assocLIN(w, f) = log2 P(w,f) / P(r|w) P(w’|w) • t-test_assoc (20.41)

  22. Figure 20.10 Co-occurrence vectors • Dependency based parser – special case of shallow parsing • identify from “I discovered dried tangerines.” (20.32) • discover(subject I) I(subject-of discover) • tangerine(obj-of discover) tangerine(adj-mod dried)

  23. Figure 20.11 Objects of the verb drink Hindle 1990

  24. vectors review • dot-product • length • sim-cosine

  25. Figure 20.12 Similarity of Vectors

  26. Fig 20.13 Vector Similarity Summary

  27. Figure 20.14 Hand-built patterns for hypernyms Hearst 1992

  28. Figure 20.15

  29. Figure 20.16

  30. http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk • NLTK 3.0a1 released : February 2013 • This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ • which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? • I want use a function for word clustering and yarowskyalgorightm for find similar collocation in a large text. • http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguistics • http://en.wikipedia.org/wiki/Portal:Linguistics • http://en.wikipedia.org/wiki/Yarowsky_algorithm • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

More Related