270 likes | 400 Views
Wordnet , Raw Text Pinker, continuing Chapter 2. Today’s Class. Wordnet Raw Text Pinker Continuing Chapter 2 (Millar). WordNet. NLTK includes the English WordNet , with 155,287 words and 117,659 synonym sets. WordNet. We can explore these words with the help of WordNet:.
E N D
Today’s Class • Wordnet • Raw Text • Pinker • Continuing Chapter 2 (Millar)
WordNet • NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets.
WordNet • We can explore these words with the help of WordNet: • Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. • The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"): • Synsets also come with a prose definition and some example sentences:
WordNet • Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having five synsets:
The WordNet Hierarchy • WordNetsynsets correspond to abstract concepts, and they don't always have corresponding words in English. • These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event — these are called unique beginners or root synsets. • Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2.11.
The WordNet Hierarchy • It’s very easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms.
The WordNet Hierarchy • We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container. • Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy.
WordNet: More Lexical Relations • Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). • For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms() • The substance a tree is made of includes heartwood and sapwood; the substance_meronyms() • A collection of trees forms a forest; the member_holonyms()
WordNet: More Lexical Relations • Some lexical relationships hold between lemmas, e.g., antonymy: • There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:
WordNet: Semantic Similarity • Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term like vehicle will match documents containing specific terms like limousine. • Two synsets linked to the same root may have several hypernyms in common. If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.
WordNet: Semantic Similarity • Of course we know that whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:
WordNet: Semantic Similarity • Similarity measures have been defined over the collection of WordNet synsets which incorporate the above insight. For example, path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy • The numbers don’t mean much, but they decrease as we move away from the semantic space of sea creatures to inanimate objects.
Computing Semantic Similarity __author__ = 'guinnc' import nltk from nltk.corpus import wordnet as wn words = ['right_whale', 'orca', 'minke_whale', 'tortoise', 'novel'] listOfSynsets = [] for word in words: firstSynset = wn.synsets(word, 'n')[0] print firstSynset listOfSynsets.append(firstSynset) #print header print '%15s' % ' ', for synset1 in listOfSynsets: firstLemma = synset1.lemma_names[0] print '%15s' % firstLemma, print for synset1 in listOfSynsets: print '%15s' % synset1.lemma_names[0], for synset2 in listOfSynsets: print '%15.2f' % synset1.path_similarity(synset2), print
VerbNet: A Verb Lexicon • VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet. • *VerbNet is the largest on-line verb lexicon currently available for English. • It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet and FrameNet. * Adapted from VerbNet website
VerbNet: A Verb Lexicon • Each VerbNet class contains a set of syntactic descriptions, depicting the possible surface realizations of the argument structure for constructions such as transitive, intransitive, prepositional phrases, etc. • Semantic restrictions (such as animate, human, organization) are used to constrain the types of thematic roles allowed by the arguments • Syntactic frames may also be constrained in terms of which prepositions are allowed. • Each frame is associated with explicit semantic information A complete entry for a frame in VerbNet class Hit-18.1 * Adapted from VerbNet website
VerbNet: A Verb Lexicon • Each verb argument is assigned one (usually unique) thematic role within the class.
NLTK and VerbNet __author__ = 'guinnc' import nltk from nltk.corpus import verbnet as vn theWord = raw_input("Type a verb: ") verb_uses= vn.classids(theWord) print verb_uses for verb in verb_uses: print vn.pprint (vn.vnclass(verb)) #print vn.pprint_subclasses(vn.vnclass(vn.classids('spray')[0]))
Processing “raw” Text in NLTK • As mentioned before, Project Gutenberg has tens of thousands of free books • Only a small subset is included with the NLTK download • Suppose you want to access a text online. How do you do it?
URLs • If you know the URL of a text, you can read it! from __future__ import division __author__ = 'guinnc' import nltk, re, pprint from urllib import urlopen url = "http://www.gutenberg.org/files/2554/2554.txt" raw = urlopen(url).read() print type(raw) print len(raw) #break it into tokens tokens = nltk.word_tokenize(raw) print len(tokens) # to use some of nltk's functions, we need to run Text on this text = nltk.Text(tokens) print text.collocations()
What about HTML files? • You can read them as “raw” files with all the HTML tags or … • Clean it up url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" html = urlopen(url).read() raw = nltk.clean_html(html) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) text.concordance('gene')
Web searches • Google prohibits web searches from programs unless you get a developer’s licence. • I was able to do this with Bing (and also ask.com): from __future__ import division __author__ = 'guinnc' import nltk, re, pprint from urllib import urlopen url = "http://www.bing.com/search?q=NLTK" html = urlopen(url).read() raw = nltk.clean_html(html) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) print text.concordance("NLTK")
Text Files on your local computer • Just use f = open(‘document.txt’) raw = f.read() • If you want to read a line at a time: f = open(‘document.txt’, ‘rU’) for line in f: print line
Unicode • Not all files (on the web, for instance) use Unicode. • Sometimes we have to translate into Unicode (decoding) and sometimes we need to go from Unicode to some other encoding (encoding).
What happens if you use the wrong encoding? import nltk import codecs path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt') f = open(path, 'rU') for line in f: print line, print f = codecs.open(path, encoding='latin2') for line in f: print line,
What’s next? • Continuing chapter 2 • Homework 3 is assigned • Regular expressions on Tuesday