350 likes | 596 Views
Lecture 24 Distributional Word Similarity II. CSCE 771 Natural Language Processing. Topics Distributional based word similarity example PMI context = syntactic dependencies Readings: NLTK book Chapter 2 ( wordnet ) Text Chapter 20. April 15, 2013. Overview. Last Time
E N D
Lecture 24Distributional Word Similarity II CSCE 771 Natural Language Processing • Topics • Distributional based word similarity • example PMI • context = syntactic dependencies • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 15, 2013
Overview • Last Time • Finish up Thesaurus based similarity • … • Distributional based word similarity • Today • Last Lectures slides 21- • Distributional based word similarity II • syntax based contexts • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II
Pointwise Mutual Informatiom (PMI) • mutual Information Church and Hanks 1989 • (eq 20.36) • PointwiseMutual Information (PMI) Fano 1961 • . (eq20.37) • assoc-PMI • (eq20.38)
Computing PPMI • Matrix F with W (words) rows and C (contexts) columns • fij is frequency of wi in cj,
Example computing PPMI p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning
Example computing PPMI p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning
PMI: More data trumps smarter algorithms • “More data trumps smarter algorithms: • Comparing pointwise mutual information • with latent semantic analysis” • Indiana University, 2009 • http://www.indiana.edu/~clcl/Papers/BSC901.pdf • “we demonstrate that this metric • benefits from training on extremely large amounts of data and • correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. “
Figure 20.10 Co-occurrence vectors Based on syntactic dependencies • Dependency based parser – special case of shallow parsing • identify from “I discovered dried tangerines.” (20.32) • discover(subject I) I(subject-of discover) • tangerine(obj-of discover) tangerine(adj-mod dried)
Defining Context using syntactic info • dependency parsing • chunking • discover(subject I) -- S NP VP • I(subject-of discover) • tangerine(obj-of discover) -- VP verb NP • tangerine(adj-mod dried) -- NP det ? ADJ N
Figure 20.11 Objects of the verb drink Hindle 1990 ACL • frequencies • it, much and anything more frequent than wine • PMI-Assoc • wine more drinkable http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf
vectors review • dot-product • length • sim-cosine
Figure 20.14 Hand-built patterns for hypernyms Hearst 1992 • Finding hypernyms (IS-A links) • (20.58) One example of red algae is Gelidium. • one example of *** is a *** • 500,000 hits on google • Semantic drift in bootstrapping
Hyponym Learning Alg. (Snow 2005) • Rely on wordnet to learn large numbers of weak hyponym patterns • Snow’s Algorithm • Collect all pairs of wordnet noun concepts with <ci IS-A cj,> • For each pair collect all sentences containing the pair • Parse the sentences and automatically extract every possible Hearst-style syntactic patterns from the parse tree • Use the large set of patterns as features in a logistic regression classifier • Given each pair extract features and use the classifier to determine if the pair is a hypernym/hyponym • New patterns learned • NPH like NP • NP is a NPH • NPH called NP • NP, a NPH (appositive)
Vector Similarities from Lin 1998 • hope (N): • optimism 0.141, chance 0.137, expectation 0.137, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108 • hope(V) • would like 0.158, wish 0.140. … • brief (N) • legal brief 0.256, affidavit 0.191, … • brief (A) • lengthy .256, hour-long 0.191, short 0.174, extended 0.163 … • full lists on page 667
Supersenses • 26 broad-category “lexicograher class” wordnet labels
wn01.py • # Wordnet examples from nltk.googlecode.com • import nltk • from nltk.corpus import wordnet as wn • motorcar = wn.synset('car.n.01') • types_of_motorcar = motorcar.hyponyms() • types_of_motorcar[26] • print wn.synset('ambulance.n.01') • print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas]) • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
wn01.py continued • print "wn.synsets('dog', pos=wn.VERB)= ", wn.synsets('dog', pos=wn.VERB) • print wn.synset('dog.n.01') • ### Synset('dog.n.01') • print wn.synset('dog.n.01').definition • ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' • print wn.synset('dog.n.01').examples • ### ['the dog barked all night']
wn01.py continued • print wn.synset('dog.n.01').lemmas • ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] • print [lemma.name for lemma in wn.synset('dog.n.01').lemmas] • ### ['dog', 'domestic_dog', 'Canis_familiaris'] • print wn.lemma('dog.n.01.dog').synset
Section 2 synsets, hypernyms, hyponyms • # Section 2 Synsets, hypernyms, hyponyms • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • print "dog hyperyms=", dog.hypernyms() • ###dog hyperyms= [Synset('domestic_animal.n.01'), Synset('canine.n.02')] • print "dog hyponyms=", dog.hyponyms() • print "dog holonyms=", dog.member_holonyms() • print "dog.roo_hyperyms=", dog.root_hypernyms() • good = wn.synset('good.a.01') • ###print "good.antonyms()=", good.antonyms() • print "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()
wn03-Lemmas.py • ### Section 3 Lemmas • eat = wn.lemma('eat.v.03.eat') • print eat • print eat.key • print eat.count() • print wn.lemma_from_key(eat.key) • print wn.lemma_from_key(eat.key).synset • print wn.lemma_from_key( 'feebleminded%5:00:00:retarded:00') • for lemma in wn.synset('eat.v.03').lemmas: • print lemma, lemma.count() • for lemma in wn.lemmas('eat', 'v'): • print lemma, lemma.count() • vocal = wn.lemma('vocal.a.01.vocal') • print vocal.derivationally_related_forms() • #[Lemma('vocalize.v.02.vocalize')] • print vocal.pertainyms() • #[Lemma('voice.n.02.voice')] • print vocal.antonyms()
wn04-VerbFrames.py • # Section 4 Verb Frames • print wn.synset('think.v.01').frame_ids • for lemma in wn.synset('think.v.01').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings • print wn.synset('stretch.v.02').frame_ids • for lemma in wn.synset('stretch.v.02').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings
wn05-Similarity.py • ### Section 5 Similarity • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • cat = wn.synset('cat.n.01') • print dog.path_similarity(cat) • print dog.lch_similarity(cat) • print dog.wup_similarity(cat) • from nltk.corpus import wordnet_ic • brown_ic = wordnet_ic.ic('ic-brown.dat') • semcor_ic = wordnet_ic.ic('ic-semcor.dat')
wn05-Similarity.py continued • from nltk.corpus import genesis • genesis_ic = wn.ic(genesis, False, 0.0) • print dog.res_similarity(cat, brown_ic) • print dog.res_similarity(cat, genesis_ic) • print dog.jcn_similarity(cat, brown_ic) • print dog.jcn_similarity(cat, genesis_ic) • print dog.lin_similarity(cat, semcor_ic)
wn06-AccessToAllSynsets.py • ### Section 6 access to all synsets • import nltk • from nltk.corpus import wordnet as wn • for synset in list(wn.all_synsets('n'))[:10]: • print synset • wn.synsets('dog') • wn.synsets('dog', pos='v') • from itertools import islice • for synset in islice(wn.all_synsets('n'), 5): • print synset, synset.hypernyms()
wn07-Morphy.py • # Wordnet in NLTK • # http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html • import nltk • from nltk.corpus import wordnet as wn • ### Section 7 Morphy • print wn.morphy('denied', wn.NOUN) • print wn.synsets('denied', wn.NOUN) • print wn.synsets('denied', wn.VERB)
8 Regression Tests • Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined: • >>> wn.synsets('book', wn.NOUN) • [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] • >>> wn.synsets('book', wn.ADJ) • [] • >>> wn.morphy('book', wn.NOUN) • 'book' • >>> wn.morphy('book', wn.ADJ)
nltk.corpus.reader.wordnet. • ic(self, corpus, weight_senses_equally=False, smoothing=1.0)Creates an information content lookup dictionary from a corpus. • http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#WordNetCorpusReader.ic • def demo(): • import nltk • print('loading wordnet') • wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') • S = wn.synset • L = wn.lemma
root_hypernyms • defroot_hypernyms(self): • """Get the topmost hypernyms of this synset in WordNet.""" • result = [] • seen = set() • todo = [self] while todo: • next_synset = todo.pop() • if next_synset not in seen: • seen.add(next_synset) • next_hypernyms = next_synset.hypernyms() + … • return result