420 likes | 698 Views
Python 3. March 15, 2011. NLTK. i mport nltk n ltk.download(). NLTK. 1. Look at the lists of available texts. import nltk from nltk.book import * texts(). NLTK. 2. Check out what the text1 (Moby Dick) object looks like. import nltk from nltk.book import * print text1[0:50]. NLTK.
E N D
Python 3 March 15, 2011
NLTK import nltk nltk.download()
NLTK 1. Look at the lists of available texts import nltk from nltk.book import * texts()
NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50]
NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50] Looks like a list of word tokens
NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10]
NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10] FreqDist is an object defined by NLTK http://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html Give it a list of word tokens It will be automatically sorted. Print the first 10 keys
NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and")
NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and") concordance is method defined for an nltk text http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window.
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list…
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) Print it like before
String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10]
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Import regular expression module
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Compile a regular expression
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] The RegEx will match any of the characters inside the brackets
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call the “sub” function associated with the RegEx named punctuation
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Replace anything that matches the RegEx with nothing
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] As before, do this to each token in the text1 list
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call this new list punctuationRemoved
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Get a FreqDist of all tokens with length >1
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Print the top 10 word tokens as usual
Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Regular Expressions are Really Powerful and Useful!
Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:]
Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:] Print the tokens from position -10 to the end
Quick Diversion • 8. And what if you wanted to see the frequencies with the words? import nltk from nltk.book import * import re print [(k, fd[k]) for k in fd.keys()[0:10]] For each key “k” in the FreqDist, print it and look up its value (fd[k])
Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString)
Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Looks similar to the RegEx that matched punctuation before
Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) This RegEx matches the substring “blue” or the substring “red” or the substring “green”
Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Here, substitute anything that matches the RegEx with the string “color”
Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” What if we wanted to identify all of the phone numbers in the string?
Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) This is a start. Output: ['18005551234'] Note that \d is a digit, and {11} matches 11 digits in a row
Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) findall will return a list of all substrings of myString that match the RegEx
Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) Also will need to know: “?” will match 0 or 1 repetitions of the previous element Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html
Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'') print phoneNumbersRegEx.findall(myString) Answer is here, but let’s derive it together
Homework • Webpage Identifying Information Write two regular expressions to match: 1. Email addresses 2. Phone numbers List, remove, or tag those found in the webpage: https://jshare.johnshopkins.edu/kchurch4/public_html/teaching/103/Spring2011/ Hint: Use part 2 of the last homework (and urllib) and two regular expressions. For phone numbers, go ahead and use the example from this class! As always, email answers to Ken (Kenneth.Church@jhu.edu) and Ann (annirvine@gmail.com) by dawn Thursday, March 17