Programming for Linguists

Programming for Linguists An Introduction to Python22/12/2011

Feedback • Ex. 1)Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?

import nltk from nltk.corpus import state_union cfd = nltk.ConditionalFreqDist((fileid, word) forfileid in state_union.fileids( ) for word in state_union.words(fileids = fileid)) fileids = state_union.fileids( ) search_words = ["men", "women", "people"] cfd.tabulate(conditions = fileids, samples = search_words)

Ex 2)According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.

import nltk fromnltk.book import * texts = [text1, text2, text3, text4, text5] fortext in texts: print text.concordance("however")

Ex 3)Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,…Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.

import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = “/Users/claudia/my_corpus” #corpus_root = “C:\Users\...” my_corpus = PlaintextCorpusReader (corpus_root, '.*’) words = my_corpus.words( ) cfd = nltk.ConditionalFreqDist((fileid, word) forfileid in my_corpus.fileids( ) for word in my_corpus.words(fileid))

fileids = my_corpus.fileids( )modals = ['can', 'could', 'may', 'might', 'must', 'will’cfd.tabulate(conditions = fileids, samples = modals)fd = nltk.FreqDist(words)all_tokens = fd.keys( )for t in all_tokens:if re.match(r'[^a-zA-Z0-9]+', t):all_tokens.remove(t)most_frequent=all_tokens[:10]most_frequent.plot( )

Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly). • Ex 2) Write the raw text of the text in the previous exercise to an output file.

import nltk import re url= “website” fromurllib import urlopen htmltext= urlopen(url).read( ) rawtext= nltk.clean_html(htmltext) rawtext2= rawtext.lower( ) tokens= nltk.wordpunct_tokenize(rawtext2) my_text= nltk.Text(tokens) wordlist_ing= [wforw in tokensifre.search(r'^.*ing$',w)]

freq_dict= { } for word in wordlist_ing: if word not in freq_dict: freq_dict[word] = 1 else: freq_dict[word] = freq_dict[word]+1 from operator import itemgetter sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True)

Ex 2) output_file = open(“dir/output.txt","w") output_file.write(str(rawtext2)+"\n") output_file.close( )

Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

Some Mentioned Issues • Loading your own corpus in NLTK with no subcategories: import nltk from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” #Mac loc = “C:\Users\claudia\my_corpus” #Windows 7 my_corpus = PlaintextCorpusReader(loc, “.*”)

Loading your own corpus in NLTK with subcategories: import nltk from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” #Mac loc=“C:\Users\claudia\my_corpus” #Windows 7 my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern=r'(cat1|cat2)/.*')

Dispersion Plot • determinethe location of a word in the text: howmanywordsfrom the beginningitappears

Exercises • Write a program thatreads a file, breaks eachlineintowords, strips whitespace and punctuationfrom the text, and converts the words to lowercase.Youcanget a list of all punctuationmarksby:import stringprint string.punctuation

import nltk, string def strip(filepath):f = open(filepath, ‘r’) text = f.read( ) tokens = nltk.wordpunct_tokenize(text) for token in tokens: token = token.lower( ) if token in string.punctuation:tokens.remove(token) return tokens

If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takesdictionaries d1 and d2 and returns a newdictionarythatcontains all the keysfrom d1 that are not in d2. Youcan set the values to None.

defsubtract(d1, d2): d3 = { }forkey in d1.keys(): ifkeynot in d2: d3[key] = None return d3

Let’s try it out: import nltk from nltk.book import * from nltk.corpus import stopwords d1 = { } for word in text7: d1[word] = None

wordlist = stopwords.words(“english”) d2 = { } for word in wordlist: d2[word] = None rest_dict = subtract(d1, d2) wordlist_min_stopwords=rest_dict.keys( )

Questions?

Evaluation Assignment • Deadline = 23/01/2012 • Conversation in the week of 23/01/12 • If you need any explanation about the content of the assignment, feel free to e-mail me

Further Reading • Since this was only a short introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about object-oriented programming

Think Python. How to Think Like a Computer Scientist? • NLTK book • Official Python documentation:http://www.python.org/doc/ • There is a newerversion of Python available, butit is not (yet) compatible with NLTK

Our research group:CLiPS: Computational Linguistics and Psycholinguistics Research Centerhttp://www.clips.ua.ac.be/ • Ourprojects:http://www.clips.ua.ac.be/projects

Happy holidays and success with your exams

Programming for Linguists