Lecture 22 Word Similarity

Lecture 22Word Similarity CSCE 771 Natural Language Processing • Topics • word similarity • Thesaurus based word similarity • Intro. Distributional based word similarity • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 8, 2013

Overview • Last Time (Programming) • Features in NLTK • NL queries  SQL • NLTK support for Interpretations and Models • Propositional and predicate logic support • Prover9 • Today • Last Lectures slides 25-29 • Features in NLTK • Computational Lexical Semantics • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II

ACL Anthology - http://aclweb.org/anthology-new/ • .

Figure 20.8 Summary of Thesaurus Similarity measures

Wordnet similarity functions • path_similarity()? • lch_similarity()? • wup_similarity()? • res_similarity()? • jcn_similarity()? • lin_similarity()?

Examples: but first a Pop Quiz • How do you get hypernyms from wordnet?

Example: P(c) values entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

Example: counts (made-up) entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

Example: P(c) values entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

Example: entity abstraction thing (not specified) physical thing idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake Color code Blue: wordnet Red: Inspired right minke

simLesk(cat, dog) ??? • (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ • (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) • (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs)

Problems with thesaurus-based • don’t always have a thesaurus • Even so problems with recall • missing words • phrases missing • thesauri work less well for verbs and adjectives • less hyponymy structure Distributional Word Similarity D. Jurafsky

Distributional models of meaning • vector-space models of meaning • offer higher recall than hand-built thesauri • less precision probably • intuition Distributional Word Similarity D. Jurafsky

Word Similarity Distributional Methods • 20.31 tezguino example (Nida) • A bottle of tezguino is on the table. • Everybody likes tezguino. • tezguino makes you drunk. • We make tezguino out of corn. • What do you know about tezguino?

Term-document matrix • Collection of documents • Identify collection of important terms, discriminatory terms(words) • Matrix: terms X documents – • term frequency tfw,d = • each document a vector in ZV: • Z= integers; N=natural numbers more accurate but perhaps misleading • Example Distributional Word Similarity D. Jurafsky

Example Term-document matrix • Subset of terms = {battle, soldier, fool, clown} Distributional Word Similarity D. Jurafsky

Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors) • window of 20 words – 10 before 10 after from Brown corpus – words that occur together • non Brown example • The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing … • Small table from the Brown 10 before 10 after

Pointwise Mutual Information • td-idf (inverse document frequency) rating instead of raw counts • idf intuition again – • pointwise mutual information (PMI) • Do events x and y occur more than if they were independent? • PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) • PMI between words • Positive PMI between two words (PPMI)

Computing PPMI • Matrix F with W (words) rows and C (contexts) columns • fij is frequency of wi in cj,

Example computing PPMI • Need counts so lets make up some • we need to edit this table to have counts

Associations • PMI-assoc • assocPMI(w, f) = log2 P(w,f) / P(w) P(f) • Lin- assoc - f composed of r (relation) and w’ • assocLIN(w, f) = log2 P(w,f) / P(r|w) P(w’|w) • t-test_assoc (20.41)

Figure 20.10 Co-occurrence vectors • Dependency based parser – special case of shallow parsing • identify from “I discovered dried tangerines.” (20.32) • discover(subject I) I(subject-of discover) • tangerine(obj-of discover) tangerine(adj-mod dried)

Figure 20.11 Objects of the verb drink Hindle 1990

vectors review • dot-product • length • sim-cosine

Figure 20.12 Similarity of Vectors

Fig 20.13 Vector Similarity Summary

Figure 20.14 Hand-built patterns for hypernyms Hearst 1992

Figure 20.15

Figure 20.16

http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk • NLTK 3.0a1 released : February 2013 • This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ • which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? • I want use a function for word clustering and yarowskyalgorightm for find similar collocation in a large text. • http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguistics • http://en.wikipedia.org/wiki/Portal:Linguistics • http://en.wikipedia.org/wiki/Yarowsky_algorithm • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

Lecture 22 Word Similarity