Programming for Linguists

Programming for Linguists An Introduction to Python15/12/2011

Tuples • A sequence of values • They are similar to lists: • the values can be any type • they are indexed by integers • Syntactically a tuple is a comma-separated list of values:t = 'a', 'b', 'c', 'd', 'e'

Althoughit is notnecessary, it is common to enclosetuples in parenthesest = ('a', 'b', 'c', 'd', 'e’) • To create a tuplewith a single element, you have to include a finalcomma:t1 = 'a’,type(t1)

Note: a value in parentheses is not a tuple !t2 = (‘a’)type(t2) • Withno argument, the tuple ( ) functioncreates a newemptytuplet = tuple( )

If the argument is a sequence (string, list or tuple), the result is a tuplewith the elements of the sequence:t = tuple(‘lupins’)print t • Most list operators alsoworkontuples:print t[0]print t[1:3]

BUT ifyoutry to modifyone of the elements of the tuple, you get an error messaget[0] = ‘A’ • Youcan’tmodify the elements of a tuple: a tuple is immutable !

Youcanreplaceonetuplewithanothert = ('A',) + t[1:]print t

Tuple Assignment • It is oftenuseful to swap the values of two variables, e.g. swap “a” with “b”temp=a a=bb=temp

More elegant with a tuple assignmenta,b = b,a • The number of variables on the left and the number of valueson the right have to be the same !a, b = 1,2,3ValueError: toomanyvalues to unpack

For example: split an email addressinto a user name and a domainaddress = ‘joske@ua.ac.be’username, domain = address.split('@')print usernameprint domain • The return valuefrom split(‘@’) is a list withtwoelements • The first element is assigned to username, the second to domain.

Tuples as Return Values • Strictlyspeaking, a functioncanonly return onevalue • If the value is a tuple, the effect is the same as returning multiple values

For example: defmin_max(t): return min(t), max(t) • max( ) and min( ) are built-in functionsthatfind the largest and smallestelements of a sequence • min_max(t) computesboth and returns a tuple of twovalues

Dictionaries and Tuples • .items ( ) functionusedondictionaries we saw last week actually returns a list of tuples, e.g.>>> d = {'a':0, 'b':1, 'c':2}>>> d.items( )[('a', 0), ('c', 2), ('b', 1)]

This way you can easily access both keys and values separately:d = {'a':0, 'b':1, 'c':2}for letter, number in d.items( ): print letter print number

Example: sorting a list of wordsbytheir word lengthdefsort_by_length(words): list1=[ ]for word in words: list1.append((len(word), word)) list1.sort(reverse=True)ordered_list=[ ] forlength, word in list1:ordered_list.append(word) return ordered_list

NLTK and the Internet • A lot of text on the web is in the form of HTML documents • To accessthem, youfirstneed to specify the correct locationurl = “http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html” • Thenuse the urlopen( ) functionfromurllib import urlopenhtmltext = urlopen(url).read( )

NLTK provides a functionnltk.clean_html( ), which takes an HTML string and returns rawtext, e.g.rawtext = nltk.clean_html(htmltext) • In order to useother NLTK methods, youcanthentokenizethe rawtext tokens=nltk.wordpunct_tokenize(rawtext)

NLTK’s WordPunctTokenizer takes as an argument raw text and returns a list of tokens (words + punctuation marks) • If you want to use the functions we used on the texts from nltk.book on your own texts, use the nltk.Text( ) functionmy_text = nltk.Text(tokens)my_text.collocations( )

Note: ifyou are used to workingwithcharacters in a particularlocalencoding (ë, è,…), youneed to include the string '# -*- coding: <coding> -*-' as the firstorsecondline of your script, e.g.# -*- coding: utf-8 -*-

Writing Results to a File • It is oftenuseful to write output to files • First you have to open/create a file foryour output output_file = open(‘(path)/output.txt’,‘w’)output_file = open(‘(path)/output.txt’,‘a’)

Now you have to write your output to the file you just openedlist = [1, 2, 3]output_file.write(str(list) + "\n”) • Whenyouwritenon-text data to a file you must convertit to a stringfirst • Do notforget to close the file whenyou are doneoutput_file.close( )

NLTK and automatic text classification • Classificationis the computationaltaskof choosing the correct class label for a giveninput text, e.g. • decidingwhetheran email is spam or not • decidingwhat the topic of a newsarticleis (e.g. sports, politics, financial,…) • authorshipattribution

Framework (1) • Gather a training corpus: • in which a categorization is possibleusingmetadata, e.g. • information about the author(s): name, age, gender, location • information about the texts’ genre: sports, humor, romance, scientific

Framework (2) • Gather a training corpus: • forwhichyouneedtoadd the metadatayourself, e.g. • annotation of content-specific information: add sentiment labelstoutterances • annotation of linguistic features: addPOS tags totext • Result: a dataset withpredefinedcategories

Framework (3) • Pre-processing of the dataset, e.g. tokenization, removing stop words • Feature selection: which features of the textcouldbeinformativeforyourclassificationtask, e.g. • lexical features: words, word bigrams,... • character features: n-grams • syntactic features: POS tags • semantic features: rolelabels • others: readability scores, TTR, wl, sl,…

Framework (4) • Divideyour dataset in a training set and a test set (usually 90% vs 10%) • Feature selectionmetrics: • based on frequencies: most frequent features • based on frequencydistributions per category: most informative features • in NLTK: Chi-square, Student'st test, Pointwise Mutual Information, Likelihood Ratio, Poisson-Stirling, Jaccardindex, Information Gain • usethemonly on training data! (overfitting)

Framework (5) • For document classification: each document in the dataset is representedby a separate instancecontaining the features extractedfrom the training data • The format of yourinstancesdepends on the classifieryou want touse • Select yourclassifier: in NLTK: NaiveBayes, Decision Tree, Maximum Entropy, link toWeka

Framework (6) • Train the classifierusing the training instancesyoucreated in the previous step • Test yourtrained model on previouslyunseen data: the test set • Evaluateyourclassifier’s performance: accuracy, precision, recalland f-scores, confusion matrix • Perform error analysis

A Case Study Classification task: classifying movie reviews into positive and negative reviews • Import the corpus from nltk.corpus import movie_reviews • Create a list of categorized documentsdocuments = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories( ) for fileid in movie_reviews.fileids(category)]

print documents[:2] 3. Shuffle your list of documents randomlyfrom random import shuffleshuffle(documents) 4. Divide your data in training en testtrain_docs = documents[:1800]test_docs = documents[1800:] 5. We only consider word unigram features here, so make a dictionary of all (normalized) words from the training data

train_words = { }for (wordlist, cat) in train_docs: for w in wordlist: w = w.lower( ) if w not in train_words:train_words[w] = 1 else:train_words[w] += 1print len(train_words)

6. Define a feature extraction functiondefextract_features(wordlist): document_words= set(wordlist) features = { } for word in document_words: word = word.lower( ) if word in train_words: features[word] = (word in document_words) return featuresprint extract_features(movie_reviews.words('pos/cv957_8737.txt'))

7. Use your feature extraction function to extract all features from your training and test set train_feats= [(extract_features(wordlist), cat) for (wordlist,cat)in train_docs]test_feats= [(extract_features(wordlist), cat) for (wordlist,cat) in test_docs]

7. Train e.g. NLTK’s Naïve Bayes classifier on the training set from nltk.classify import NaiveBayesClassifierclassifier = NaiveBayesClassifier.train(train_feats)predicted_labels = classifier.batch_classify([fs for (fs, cat) in test_feats]) 8. Evaluate the model on the test set print nltk.classify.accuracy(classifier, test_feats)classifier.show_most_informative_features(20)

For Next Week • Feedback on the past exercises • Some extra exercises • If you have additional questions or problems, please e-mail me by Wednesday • The evaluation assignment will be announced

Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly). • Ex 2) Write the raw text of the text in the previous exercise to an output file.

Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

Thank you

Programming for Linguists