1 / 64

Lecture 22 Word Similarity

Lecture 22 Word Similarity. CSCE 771 Natural Language Processing. Topics word similarity Thesaurus based word similarity I ntro. Distributional based word similarity Readings: NLTK book Chapter 2 ( wordnet ) Text Chapter 20. April 8, 2013. Overview. Last Time (Programming)

Download Presentation

Lecture 22 Word Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 22Word Similarity CSCE 771 Natural Language Processing • Topics • word similarity • Thesaurus based word similarity • Intro. Distributional based word similarity • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 8, 2013

  2. Overview • Last Time (Programming) • Features in NLTK • NL queries  SQL • NLTK support for Interpretations and Models • Propositional and predicate logic support • Prover9 • Today • Last Lectures slides 25-29 • Features in NLTK • Computational Lexical Semantics • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II

  3. Figure 20.1 Possible sense tags for bass Chapter 20 – Word Sense disambiguation (WSD) Machine translation Supervised vs unsupervised learning Semantic concordance – corpus with words tagged with sense tags

  4. Feature Extraction for WSD • Feature vectors • Collocation • [wi-2, POSi-2, wi-1, POSi-1, wi, POSi, wi+1, POSi+1, wi+2, POSi+2] • Bag-of-words – unordered set of neighboring words • Represent sets of most frequent content words with membership vector • [0,0,1,0,0,0,1] – set of 3rd and 7th most freq. content word • Window of nearby words/features

  5. Naïve Bayes Classifier • w – word vector • s – sense tag vector • f – feature vector [wi, POSi ] for i=1, …n • Approximate by frequency counts • But how practical?

  6. Looking for Practical formula • . • Still not practical

  7. Naïve == Assume Independence Now practical, but realistic?

  8. Training = count frequencies • . • Maximum likelihood estimator (20.8)

  9. Decision List Classifiers • Naïve Bayeshard for humans to examine decisions and understand • Decision list classifiers - like “case” statement • sequence of (test, returned-sense-tag) pairs

  10. Figure 20.2 Decision List Classifier Rules

  11. WSD Evaluation, baselines, ceilings • Extrinsic evaluation - evaluating embedded NLP in end-to-end applications (in vivo) • Intrinsic evaluation – WSD evaluating by itself (in vitro) • Sense accuracy • Corpora – SemCor, SENSEVAL, SEMEVAL • Baseline - Most frequent sense (wordnet sense 1) • Ceiling – Gold standard – human experts with discussion and agreement

  12. Similarity of Words or Senses • generally we will be saying words but giving similarity of word senses • similarity vs relatedness • ex similarity • ex relatedness • Similarity of words • Similarity of phrases/sentence (not usually done)

  13. Figure 20.3 Simplified Lesk Algorithm gloss/sentence overlap

  14. Simplified Lesk example • The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable rate mortgage securities.

  15. Corpus Lesk • Using equals weights on words just does not seem right • weights applied to overlap words • inverse document frequency • idfi = log (Ndocs / num docs containing wi)

  16. SENSEVAL competitions • http://www.senseval.org/ • Check the Senseval-3 website.

  17. SemEval-2 -Evaluation Exercises on Semantic Evaluation - ACL SigLex event

  18. Task Name Area • #1 Coreference Resolution in Multiple Languages Coref • #2 Cross-Lingual Lexical Substitution Cross-Lingual, Lexical Substitu • #3 Cross-Lingual Word Sense Disambiguation Cross-Lingual, Word Senses • #4 VP Ellipsis - Detection and Resolution Ellipsis • #5 Automatic Keyphrase Extraction from Scientific Articles • #6 Classification of Semantic Relations between MeSH Entities in Swedish Medical Texts • #7 Argument Selection and Coercion Metonymy • #8 Multi-Way Classification of Semantic Relations Between Pairs of Nominals • #9 Noun Compound Interpretation Using Paraphrasing Verbs Noun compounds • #10 Linking Events and their Participants in Discourse Semantic Role Labeling, Information Extraction • #11 Event Detection in Chinese News Sentences Semantic Role Labeling, Word Senses • #12 Parser Training and Evaluation using Textual Entailment • #13 TempEval 2 Time Expressions • #14 Word Sense Induction • #15 Infrequent Sense Identification for Mandarin Text to Speech Systems • #16 Japanese WSD Word Senses • #17 All-words Word Sense Disambiguation on a Specific Domain (WSD-domain) • #18 Disambiguating Sentiment Ambiguous Adjectives Word Senses, Sentim

  19. 20.4.2 Selectional Restrictions and Preferences • verb eat  theme=object has feature Food+ • Katz and Fodor 1963 used this idea to rule out senses that were not consistent • WSD of disk • (20.12) “In out house, evrybody has a career and none of them includes washing dishes,” he says. • (20.13) In her tiny kitchen, Ms, Chen works efficiently, stir-frying several simple dishes, inlcuding … • Verbs wash, stir-frying • wash  washable+ • stir-frying  edible+

  20. Resnik’s model of Selectional Association • How much does a predicate tell you about the semantic • class of its arguments? • eat  • was, is, to be … • selectional preference strength of a verb is indicated by two distributions: • P(c) how likely the direct object is to be in class c • P(c|v) the distribution of expected semantic classes for the particular verb v • the greater the difference in these distributions means the verb provides more information

  21. Relative entropy – Kullback-Leibler divergence • Given two distributions P and Q • D(P || Q) = ∑ P(x) log (p(x)/Q(x)) (eq 20.16) • Selectional preference • SR(v) = D( P(c|v) || P(c)) =

  22. Resnik’s model of Selectional Association

  23. High and Low Selectional Associations – Resnik 1996 • Selectional Associations

  24. 20.5 Minimally Supervised WSD: Bootstrapping • “supervised and dictionary methods require large hand-built resources” • bootstrapping or semi-supervised learning or minimally supervised learning to address the no-data problem • Start with seed set and grow it.

  25. Yarowsky algorithm preliminaries • Idea of bootstrapping: “create a larger training set from a small set of seeds” • Heuritics: senses of “bass” • one sense per collocation • in a sentence both senses of bass are not used • one sense per discourse • Yarowskyshowed that of 37,232 examples of bass occurring in a discourse there was only one sense per discourse • Yarowsky

  26. Yarowsky algorithm • Goal: learn a word-sense classifier for a word • Input: Λ0 small seed set of labeled instances of each sense • train classifier on seed-set Λ0, • label the unlabeled corpus V0 with the classifier • Select examples delta in V that you are “most confident in” • Λ1=Λ0 + delta • Repeat

  27. Figure 20.4 Two senses of plant • Plant 1 – manufacturing plant … • plant 2 – flora, plant life

  28. 2009 Survey of WSD by Navigili • , iroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf

  29. Figure 20.5 Samples of bass-sentences from WSJ (Wall Street Journal)

  30. Word Similarity: Thesaurus Based Methods Figure 20.6 Path Distances in hierarchy • Wordnet of course (pruned)

  31. Figure 20.6 Path Based Similarity • . • \ • simpath(c1, c2)= 1/pathlen(c1, c2) (length + 1)

  32. WN -hierarchy • # Wordnet examples from NLTK book • import nltk • from nltk.corpus import wordnet as wn • right = wn.synset('right_whale.n.01') • orca = wn.synset('orca.n.01') • minke = wn.synset('minke_whale.n.01') • tortoise = wn.synset('tortoise.n.01') • novel = wn.synset('novel.n.01') • print "LCS(right, minke)=",right.lowest_common_hypernyms(minke) • print "LCS(right, orca)=",right.lowest_common_hypernyms(orca) • print "LCS(right, tortoise)=",right.lowest_common_hypernyms(tortoise) • print "LCS(right, novel)=", right.lowest_common_hypernyms(novel)

  33. #path similarity • print "Path similarities" • print right.path_similarity(minke) • print right.path_similarity(orca) • print right.path_similarity(tortoise) • print right.path_similarity(novel) • Path similarities • 0.25 • 0.166666666667 • 0.0769230769231 • 0.0434782608696

  34. Wordnet in NLTK • http://nltk.org/_modules/nltk/corpus/reader/wordnet.html • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html (partially in Chap 02 NLTK book; but different version) • http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py • code for similarity – runs for a while; lots of results • x

  35. https://groups.google.com/forum Hi, I was wondering if it is possible for me to use NLTK + wordnet togroup (nouns) words together via similar meanings? Assuming I have 2000 words or topics. Is it possible for me to groupthem together according to similar meanings using NLTK? So that at the end of the day I would have different groups of wordsthat are similar in meaning? Can that be done in NLTK? and possibly beable to detect salient patterns emerging? (trend in topics etc...). Is there a further need for a word classifier based on the CMU BOWtoolkit to classify words to get it into categories? or the above groupwould be good enough? Is there a need to classify words further? How would one classify words in NLTK effectively? Really hope you can enlighten me? FM

  36. Response from Steven Bird 2010/3/5 Republic <ngfo...@gmail.com>: > Assuming I have 2000 words or topics. Is it possible for me to group> them together according to similar meanings using NLTK? You could compute WordNet similarity (pairwise), so that eachword/topic is represented as a vector of distances, which could thenbe discretized, so each vector would have a form like this:[0,2,3,1,0,0,2,1,3,...].  These vectors could then be clustered usingone of the methods in the NLTK cluster package. > So that at the end of the day I would have different groups of words> that are similar in meaning? Can that be done in NLTK? and possibly be> able to detect salient patterns emerging? (trend in topics etc...). This suggests a temporal dimension, which might mean recomputing theclusters as more words or topics come in. It might help to read the NLTK book sections on WordNet and on textclassification, and also some of the other cited material. -Steven Bird

  37. More general? Stack-Overflow • import nltk • from nltk.corpus import wordnet as wn • waiter = wn.synset('waiter.n.01') • employee = wn.synset('employee.n.01') • all_hyponyms_of_waiter = list(set([w.replace("_"," ") • for s in waiter.closure(lambda s:s.hyponyms()) • for w in s.lemma_names])) • all_hyponyms_of_employee = … • if 'waiter' in all_hyponyms_of_employee: • print 'employee more general than waiter' • elif 'employee' in all_hyponyms_of_waiter: • print 'waiter more general than employee' • else: http://stackoverflow.com/questions/...-semantic-hierarchies-relations-in--nltk

  38. print wn(help) • … • | res_similarity(self, synset1, synset2, ic, verbose=False) • | Resnik Similarity: • | Return a score denoting how similar two word senses are, based on the • | Information Content (IC) of the Least Common Subsumer (most specific • | ancestor node). • http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py

  39. Similarity based on a hierarchy (=ontology)

  40. Information Content word similarity

  41. Resnick Similarity / Wordnet • simresnick(c1, c2) = -log P(LCS(c1, c2))\ • wordnet • res_similarity(self, synset1, synset2, ic, verbose=False) • | Resnik Similarity: • | Return a score denoting how similar two word senses are, based on the • | Information Content (IC) of the Least Common Subsumer (most specific • | ancestor node).

  42. Fig 20.7 Wordnet with Lin P(c) values Change for Resnick!!

  43. Lin variation 1998 • Commonality – • Difference – • IC(description(A,B)) – IC(common(A,B)) • simLin(A,B) = Common(A,B) / description(A,B)

  44. Fig 20.7 Wordnet with Lin P(c) values

  45. Extended Lesk • based on • glosses • glosses of hypernyms, hyponyms • Example • drawing paper: paper that is specially prepared for use in drafting • decal: the art of transferring designs from specially preparedpaper to a wood, glass or metal surface. • Lesk score = sum of squares of lengths of common phrases • Example: 1 + 22 = 5

  46. Figure 20.8 Summary of Thesaurus Similarity measures

  47. Wordnet similarity functions • path_similarity()? • lch_similarity()? • wup_similarity()? • res_similarity()? • jcn_similarity()? • lin_similarity()?

  48. Problems with thesaurus-based • don’t always have a thesaurus • Even so problems with recall • missing words • phrases missing • thesauri work less well for verbs and adjectives • less hyponymy structure Distributional Word Similarity D. Jurafsky

  49. Distributional models of meaning • vector-space models of meaning • offer higher recall than hand-built thesauri • less precision probably Distributional Word Similarity D. Jurafsky

  50. Word Similarity Distributional Methods • 20.31 tezguino example • A bottle of tezguino is on the table. • Everybody likes tezguino. • tezguino makes you drunk. • We make tezguino out of corn. • What do you know about tezguino?

More Related