980 likes | 1.58k Views
Introduction to Natural Language Processing. Source: Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit. Status. We have progressed with Object -Oriented Programming in Python Simple I/O, File I/O Lists, Strings, Tuples , and their methods
E N D
Introduction to Natural Language Processing Source: Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit
Status • We have progressed with Object-Oriented Programming in Python • Simple I/O, File I/O • Lists, Strings, Tuples, and their methods • Numeric types and operations • Control structures: if, for, while • Function definition and use • Parameters for defining the function, arguments for calling the function
Applying what we have • We have looked at some of the NLTK book. • Chapter 1 of the NLTK book repeats much of what we see in the other text. • Now in the context of an application domain: Natural Language Processing • Note: there are similar packages for other domains • Book examples in chapter 1 are all done with the interactive python shell
Reasons • What can we achieve by combining simple programming techniques with large quantities of text? • How can we automatically extract key words and phrases that sum up the style and content of a text? • What tools and techniques does the Python programming language provide for such work? • What are some of the interesting challenges of natural language processing? Quote from nltk book Since text can cover any subject area, it is a general interest area to explore in some depth.
The NLTK • The natural language tool kit • modules • datasets • tutorials • Contains: align, app (package), book, ccg (package), chat (package, chunk (package), classify (package), cluster (package), collocations, compat, containers, corpus (package), data, decorators, downloader, draw (package), etree (package), evaluate, examples (package), featstruct, grammar), help, inference (package), internals, lazyimport, metrics (package), misc (package), model (package), olac, parse (package), probability, sem (package), sourcedstring, stem (package), tag (package), text, tokenize (package), toolbox (package), tree, treetransforms, util, yamltags We will not have time to explore all of them, but this gives a full list for further exploration.
Recall - the NLTK >>> import nltk >>> nltk.download() Do it now, if you have not done so opens a window showing this:
Getting data from the downloaded files • Previously, we used from math import pi • to get something specific from a module • Now, from the nltk.book, we will get the text files we will use • from nltk.book import *
Import the data files Do it now. Then type sent1 at a python prompt to see the fist sentence of Moby Dick Repeat for sent2 .. sent9 to see the first sentence of each text. Take note of the collection of texts. Great variety. Different ones will be useful for different types of exploration >>> import nltk >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 What type of data is each first sentence?
Searching the texts >>> text9.concordance("sunset") Building index... Displaying 14 of 14 matches: E suburb of Saffron Park lay on the sunset side of London , as red and ragged n , as red and ragged as a cloud of sunset . It was built of a bright brick th bered in that place for its strange sunset . It looked like the end of the wor ival ; it was upon the night of the sunset that his solitude suddenly ended . he Embankment once under a dark red sunset . The red river reflected the red s st seemed of fiercer flame than the sunset it mirrored . It looked like a stre he passionate plumage of the cloudy sunset had been swept away , and a naked m der the sea . The sealed and sullen sunset behind the dark dome of St . Paul ' ming with the colour and quality of sunset . The Colonel suggested that , befo gold . Up this side street the last sunset light shone as sharp and narrow as of gas , which in the full flush of sunset seemed coloured like a sunset cloud sh of sunset seemed coloured like a sunset cloud . " After all ," he said , " y and quietly , like a long , low , sunset cloud , a long , low house , mellow house , mellow in the mild light of sunset . All the six friends compared note A concordance shows a word in context
Same word in different texts >>> text1.concordance("monstrous") Building index... Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u >>> text2.concordance("monstrous") Building index... Displaying 11 of 11 matches: . " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went your sister is to marry him . I am monstrous glad of it , for then I shall have ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho k how you will like them . Lucy is monstrous pretty , and so good humoured and Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n t however , as it turns out , I am monstrous glad there was never any thing in so scornfully ! for they say he is monstrous fond of her , as well he may . I s possible that she should ." " I am monstrous glad of it . Good gracious ! I hav thing of the kind . So then he was monstrous happy , and talked on some time ab e very genteel people . He makes a monstrous deal of money , and they keep thei >>> Moby Dick Sense and Sensibility
>>> text1.similar("monstrous") abundant candid careful christian contemptible curious delightfully determined doleful domineering exasperate fearless few gamesome horrible impalpable imperial lamentable lazy loving >>> >>> text2.similar("monstrous") Building word-context index... very exceedingly heartily so a amazingly as extremely good great remarkably sweet vast >>> Note different sense of the word in the two texts.
Looking at vocabulary >>> len(text3) 44764 >>> Total number of tokens, includes non words and repeated words >>> len(set(text3)) 2789 >>> len(set(text2)) 6833 >>> What do these numbers mean?
>>> float(len(text2))/float(len(set(text2))) 20.719449729255086 >>> A rough measure of lexical richness What does this tell us? On average, a word is used > 20 times >>> from __future__ import division >>> 100*text2.count("money")/len(text2) 0.018364694581002431 >>> What does this tell us? Note two ways to get floating point results when dividing integers
Making life easier >>> def lexical_diversity(text): ... return len(text) / len(set(text)) ... >>> def percentage(count,total): ... return 100*count/total ... >>> lexical_diversity(text2) 20.719449729255086 >>> percentage(text2.count('money'),len(text2)) 0.018364694581002431 >>>
Spot check • Modify the function percentage so that you only have to pass it the name of the text and the word to count • the new call will look like this: • percentage(text2, “money”) • In which of the texts is “money” most dominant? • Where is it least dominant? • What are the percentages for each text?
Indexing the texts • Each of the texts is a list, and so all our list methods work, including slicing: The first 101 elements in the list for text2 (Sense and Sensibility) Note that the first element is itself a list. >>> text2[0:100] ['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.', 'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland', 'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many', 'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single', 'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion'] >>>
Text index • We can see what is at a position: >>> text2[302] 'devolved’ • And where a word appears: >>> text2.index('marriage') 255 >>> Remember that indexing begins at 0 and the index tells how far removed you are from the initial element.
Strings • Each of the elements in each of the text lists is a string, and all the string methods apply.
Frequency distributions >>> fdist1=FreqDist(text1) >>> fdist1 <FreqDist with 260819 outcomes> >>> vocabulary1=fdist1.keys() >>> vocabulary1[:50] [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] >>> These are the 50 most common tokens in the text of Moby Dick. Many of these are not useful in characterizing the text. We call them “stop words” and will see how to eliminate them from consideration later.
More precise specification • Consider the mathematical expression • Python implementation is • [w for w in V if p(w)] List comprehension – we saw it first last week >>> AustenVoc=set(text2) >>> long_words_2=[w for w in AustenVoc if len(w) >15] >>> long_words_2 ['incomprehensible', 'disqualifications', 'disinterestedness', 'companionableness'] >>>
Add to the condition fdist2=FreqDist(text2) >>> long_words_2=sorted([w for w in AustenVoc if len(w) >12 and fdist2[w]>5]) >>> long_words_2 ['Somersetshire', 'accommodation', 'circumstances', 'communication', 'consciousness', 'consideration', 'disappointment', 'distinguished', 'embarrassment', 'encouragement', 'establishment', 'extraordinary', 'inconvenience', 'indisposition', 'neighbourhood', 'unaccountable', 'uncomfortable', 'understanding', 'unfortunately'] So, our if p(w) can be as complex as we need
Spot check • Find all the words longer than 12 characters, which occur at least 5 times, in each of the texts. • How well do they give you a sense of the texts?
Collocations and Bigrams • Sometimes a word by itself is not representative of its role in a text. It is only with a companion word that we get the intended sense. • red wine • high horse • sign of hope • Bigrams are two word combinations • not all bigrams are useful, of course • len(bigrams(text2)) == 141575 • including “and among”, “they could” , … • Collocations provides bigrams that include uncommon words – words that might be significant in the text. • text2.collocations has 20 pairs
>>> colloc2=text2.collocations() Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing; thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele; every body; John Dashwood; great deal; Harley Street; Berkeley Street; Miss Dashwoods; young man; Combe Magna; every day; next morning >>> [len(w) for w in text2] [1, 5, 3, 11, 2, 4, 6, 4, 1, 7, 1, 3, 6, 2, 8, 3, 4, 4, 7, 2, 6, 1, 5, 6, 3, 5, 1, 3, 5, 9, 3, 2, 7, 4, 1, 2, 3, 6, 2, 5, 8, 1, 5, 1, 3, 4, 11, 1, 4, 3, 5, 2, 2, 11, 1, 6, 2, 2, 6, 3, 7, 4, 7, 2, 5, 11, 12, 1, 3, 4, 5, 2, 4, 6, 3, 1, 6, 3, 1, 3, 5, 2, 1, 4, 8, 3, 1, 3, 3, 3, 4, 5, 2, 3, 4, 1, 3, 1, 8, 9, 3, 11, 2, 3, 6, 1, 3, 3, 5, 1, 5, 8, 3, 5, 6, 3, 3, 1, 8, … For each word in text2, return its length >>> fdist2=FreqDist([len(w) for w in text2]) >>> fdist2 <FreqDist with 141576 outcomes> >>> fdist2.keys() [3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16] >>> There are 141,576 words, each with a length. But there are only 17 different word lengths.
>>> fdist2.items() [(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8, 5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17, 3), (16, 2)] >>> There are 28,839 3-letter words in Sense and Sensibility (not unique words, necessarily) >>> fdist2.keys() [3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16] >>> fdist2.items() [(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8, 5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17, 3), (16, 2)] >>> fdist2.max() 3 >>> fdist2[3] 28839 >>> fdist2[13] 334 >>> There are 28,839 3-letter words and 334 13-letter words in Sense and Sensibility
Conditionals FunctionMeaning s.startswith(t) test if s starts with t s.endswith(t) test if s ends with t t in s test if t is contained inside s s.islower() test if all cased characters in s are lowercase s.isupper() test if all cased characters in s are uppercase s.isalpha() test if all characters in s are alphabetic s.isalnum() test if all characters in s are alphanumeric s.isdigit() test if all characters in s are digits s.istitle() test if s is titlecased (all words in s have have initial capitals) We have seen conditionals and loop statements. These are some special functions for work on text
Spot check From the NLTK book: Run the following examples and explain what is happening. Then make up some tests of your own. >>> sorted([w for w in set(text7) if '-' in w and 'index' in w]) >>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10]) >>> sorted([w for w in set(sent7) if not w.islower()]) >>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
Ending the double count of words • The count of words from the various texts was flawed. How? • We had • What’s the problem? How do we fix it? >>> len(text1) 260819 >>> len(set(text1)) 19317 >>> len(set([word.lower() for word in text1])) 17231 >>> >>> len(set([word.lower() for word in text1 if word.isalpha()])) 16948 >>>
Nested loops and loops with conditions • Follow what happens. >>> for token in sent1: ... if token.islower(): ... print token, 'is a lowercase word' ... eliftoken.istitle(): ... print token, 'is a titlecase word' ... else: ... print token, 'is punctuation' ... Call is a titlecase word me is a lowercase word Ishmael is a titlecase word . is punctuation >>>
Another example >>> tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w]) >>> for word in tricky: ... print word, ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive ... >>>
Automatic Text Understanding • See section 1.5 • Some realistic, interesting problems associated with Natural Language Processing • Word sense disambiguation a. The lost children were found by the searchers (agentive) b. The lost children were found by the mountain (locative) c. The lost children were found by the afternoon (temporal) • Pronoun resolution a. The thieves stole the paintings. They were subsequently sold. b. The thieves stole the paintings. They were subsequently caught. c. The thieves stole the paintings. They were subsequently found.
Generating text! >>> text4.generate() Building ngram index... Fellow - Citizens : Under Providence I have given freedom new reach , and maintain lasting peace -- based on righteousness and justice . There was this reason only why the cotton - producing States should be promoted by just and abundant society , on just principles . These later years have elapsed , and civil war . More than this , we affirm a new beginning is a destiny . May Congress prohibit slavery in the workshop , in translating humanity ' s strongest , but we have adopted , and fear of God . And , in each >>> An inaugural address?? -- MIT hoax – conference submission
Translation Babel> How long before the next flight to Alice Springs? Babel> german Babel> run 0> How long before the next flight to Alice Springs? 1> WielangvordemfolgendenFlugzu Alice Springs? 2> How long before the following flight to Alice jump? 3> WielangvordemfolgendenFlugzu Alice springenSie? 4> How long before the following flight to Alice do you jump? 5> Wielang, bevorderfolgendeFlugzu Alice tun, Siespringen? 6> How long, before the following flight to Alice does, do you jump? 7> WielangbevorderfolgendeFlugzu Alice tut, tunSiespringen? 8> How long before the following flight to Alice does, do you jump? 9> Wielang, bevorderfolgendeFlugzu Alice tut, tunSiespringen? 10> How long, before the following flight does to Alice, do do you jump? 11> WielangbevorderfolgendeFlugzu Alice tut, Sietun Sprung? 12> How long before the following flight does leap to Alice, does you? Babel>
Jeopardy and Watson The ultimate example of a machine and language http://www.youtube.com/watch?v=xm8iUjzgPTg&feature=related http://www.youtube.com/watch?v=7h4baBEi0iA&feature=related -- the strange response http://www.youtube.com/watch?src_vid=7h4baBEi0iA&feature=iv&v=lI-M7O_bRNg&annotation_id=annotation_383798#t=3m11s Explanation of the strange response
Text corpora • A collection of text entities • Usually there is some unifying characteristic, but not always • Typical examples • All issues of a newspaper for a period of time • A collection of reports from a particular industry or standards body • More recent • The whole collection of posts to twitter • All the entries in a blog or set of blogs
Check it out • Go to http://www.gutenberg.org/ • Take a few minutes to explore the site. • Look at the top 100 downloads of yesterday • Can you characterize them? What do you think of this list?
Corpora in nltk • The nltk includes part of the Gutenberg collection • Find out which ones by >>>nltk.corpus.gutenberg.fileids() • These are the texts of the Gutenberg collection that are downloaded with the nltk package.
Accessing other texts • We will explore the files loaded with nltk • You may want to explore other texts also. • From the help(nltk.corpus): • If C{item} is one of the unique identifiers listed in the corpus module's C{items} variable, then the corresponding document will be loaded from the NLTK corpus package. • If C{item} is a filename, then that file will be read. For now – just a note that we can use these tools on other texts that we download or acquire from any source.
Using the tools we saw before • The particular texts we saw in chapter 1 were accessed through aliases that simplified the interaction. • Now, more general case, we have to do more. • To get the list of words in a text: >>>emma = nltk.corpus.gutenberg.words('austen-emma.txt') • Now we have the form we had for the texts of Chapter 1 and can use the tools found there. Try: >>> len(emma) Note the frequency of use of Jane Austen books ???
Shortened reference • Global context • Instead of citing the gutenberg corpus for each resource, >>> from nltk.corpus import gutenberg >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...] >>> emma = gutenberg.words('austen-emma.txt') • So, nltk.corpus.gutenberg.words('austen-emma.txt') becomes just gutenberg.words('austen-emma.txt')
Other access options • gutenberg.words('austen-emma.txt') • the words of the text • gutenberg.raw('austen-emma.txt') • the original text, no separation into tokens (words). One long string. • gutenberg.sents('austen-emma.txt') • the text divided into sentences
Some code to run • Enter and run the code for counting characters, words, sentences and finding the lexical diversity score of each text in the corpus. import nltk from nltk.corpus import gutenberg for fileid in gutenberg.fileids(): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), \ int(num_words/num_vocab), fileid Short, simple code. Already seeing some noticeable time to execute
Modify the code • Simple change – print out the total number of characters, words, sentences for each text.
The text corpus • Take a look at your directory of nltk_data to see the variety of text materials accessible to you. • Some are not plain text and we cannot use them yet – but will • Of the plain text, note the diversity • Classic published materials • News feeds, movie reviews • Overheard conversations, internet chat • All categories of language are needed to understand the language as it is defined and as it is used.
The Brown Corpus • First 1 million word corpus • Explore – • what are the categories? • Access words or sentences from one or more categories or fileids >>> from nltk.corpus import brown >>> brown.categories() >>> brown.fileids(categories=”<choose>")
Sylistics • Enter that code and run it. • What does it give you? • What does it mean? >>> from nltk.corpus import brown >>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals: ... print m + ':', fdist[m],
Spot check • Repeat the previous code, but look for the use of those same words in the categories for religion, government • Now analyze the use of the “wh” words in the news category and one other of your choice. (Who, What, Where, When, Why)
One step comparison • Consider the following code: import nltk from nltk.corpus import brown cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] modals = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions=genres, samples=modals) Enter and run it. What does it do?
Other corpora • There is some information about the Reuters and Inaugural address corpora also. Take a look at them with the online site. (5 minutes or so)