Python 3

Python 3 March 15, 2011

NLTK import nltk nltk.download()

NLTK 1. Look at the lists of available texts import nltk from nltk.book import * texts()

NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50]

NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50] Looks like a list of word tokens

NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10]

NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10] FreqDist is an object defined by NLTK http://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html Give it a list of word tokens It will be automatically sorted. Print the first 10 keys

NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and")

NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and") concordance is method defined for an nltk text http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window.

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list…

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) Print it like before

String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10]

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Import regular expression module

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Compile a regular expression

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] The RegEx will match any of the characters inside the brackets

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call the “sub” function associated with the RegEx named punctuation

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Replace anything that matches the RegEx with nothing

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] As before, do this to each token in the text1 list

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call this new list punctuationRemoved

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Get a FreqDist of all tokens with length >1

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Print the top 10 word tokens as usual

Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Regular Expressions are Really Powerful and Useful!

Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:]

Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:] Print the tokens from position -10 to the end

Quick Diversion • 8. And what if you wanted to see the frequencies with the words? import nltk from nltk.book import * import re print [(k, fd[k]) for k in fd.keys()[0:10]] For each key “k” in the FreqDist, print it and look up its value (fd[k])

Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString)

Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Looks similar to the RegEx that matched punctuation before

Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) This RegEx matches the substring “blue” or the substring “red” or the substring “green”

Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Here, substitute anything that matches the RegEx with the string “color”

Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” What if we wanted to identify all of the phone numbers in the string?

Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) This is a start. Output: ['18005551234'] Note that \d is a digit, and {11} matches 11 digits in a row

Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) findall will return a list of all substrings of myString that match the RegEx

Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) Also will need to know: “?” will match 0 or 1 repetitions of the previous element Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html

Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'') print phoneNumbersRegEx.findall(myString) Answer is here, but let’s derive it together

Homework • Webpage Identifying Information Write two regular expressions to match: 1. Email addresses 2. Phone numbers List, remove, or tag those found in the webpage: https://jshare.johnshopkins.edu/kchurch4/public_html/teaching/103/Spring2011/ Hint: Use part 2 of the last homework (and urllib) and two regular expressions. For phone numbers, go ahead and use the example from this class! As always, email answers to Ken (Kenneth.Church@jhu.edu) and Ann (annirvine@gmail.com) by dawn Thursday, March 17

Python 3

Python 3

Presentation Transcript

Python

Python

Python

Python

Python

Python – Part 3

Python Functions : chapter 3

Python: 2 or 3?

Python 3

CS3101 Python Lecture 3

Python (version 3)

Python 3

Python

Python Course | Python Programming | Python Tutorial | Python Training | Edureka

Python Programming, 3/e

Python Programming, 3/e

Programming Python – Lecture#3

Introduction to Python 3.x

Python 3