Programming for Linguists

Programming for Linguists An Introduction to Python08/12/2011

Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest

Ex. 1)def word_length( ):count=5 list1 = [ ]whilecount > 0: s= raw_input( "Please enter a word ”) list1.append(s)count= count-1longest= list1[0]shortest= list1[0]for word in list1:iflen(word) > len(longest):longest=wordeliflen(word) < len(shortest):shortest=word print shortest,"is the shortest word.” print longest,"is the longest word."

Ex 2) Write a function that takes a sentence as an argument and calculates the average word length of the words in that sentence

Ex 2)def awl(sent):wlist = [ ] sentence = sent.split( ) for word in sentence:wlist.append(len(word)) mean = sum(wlist)/float(len(wlist)) print “The average word length is ”,meanawl(“this is a test sentence”)

Ex 3) Take a short text of about 5 sentences. Write a script that will split up the text into sentences (tip: use the punctuation as boundaries) and calculates the average sentence length, the average word length and the standard deviation for both values

Ex 3)import redefmean(list):mean = sum(list)/float(len(list)) return meandef SD(list):devs = [ ]for item in list:std = (item – mean(list))**2devs.append(std) SD = (sum(devs) / float(len(devs))**0.5 return SD

defstatistics(sent):asl = [ ]awl= [ ]sentences = re.split(r ‘[.!?]’, sent) forsentence in sentences[:-1]:sentence = re.sub(r ‘\W+’, ‘ ’,sentence) tokens = sentence.split( )asl.append(len(tokens))for token in tokens:awl.append(len(token)) print mean(asl), SD(asl) print mean(awl), SD(awl) statistics(“sentences”)

Dictionaries • Like a list, but more general • In a list the index has to be an integer, e.g. words[4] • In a dictionary the index can be almost any type • A dictionary is like a mapping between 2 sets: keys and values • function:dict( )

To create an empty list:list = [ ] • To create an empty dictionary:dictionary = { } • For example a dictionary containing English and Spanish words:eng2sp = { }eng2sp['one'] = 'uno’print eng2sp{'one': 'uno'}

In this case both the keys and the values are of the string type • Like with lists, you can create dictionaries yourselves, e.g.eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}print eng2sp • Note: in general, the order of items in a dictionary is unpredictable

You can use the keys to look up the corresponding values, e.g.print eng2sp['two'] • The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter • If the key is not in the dictionary you get an error message, e.g.print eng2sp[‘ten’]KeyError: ‘ten’

The len( ) function returns the number of key-value pairslen(eng2sp) • The in operator tellsyouwhethersomethingappearsas a key in the dictionary‘one’ in eng2spTrue • BUT‘uno’ in eng2spFalse

To see whethersomethingappears as a value in a dictionary, youcanuse the values( ) function, which returns the values as a list, and thenuse the in operator, e.g.‘uno’ in eng2sp.values( )True

A Dictionary as a Set of Counters • Suppose you want to count the number of times each letter occurs in a string, you could: • create 26 variables, traverse the string and, for each letter, add 1 to the corresponding counter • create a dictionary with letters as keys and counters as the corresponding values

def frequencies(sent): freq_dict = { }for let in sent: if let not in freq_dict:freq_dict[let] = 1 else:freq_dict[let] += 1 return freq_dict frequencies(“abracadabra”)

The first line of the function creates an empty dictionary • The for loop traverses the string • Each time through the loop, if the letter is not in the dictionary, we create a new key item with the initial value 1 • If the letter is already in the dictionary we add 1 to its corresponding value

Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary

def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq words(“this is is a a test sentence”)

Reverse Lookup • Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”] • This operation is called a lookup • What if you know the value and want to look up the corresponding key?

Previous example: def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq w_fr = words(“this is is a a test sentence”)

Write a function which takes as argument the variable w_fr and the nr number of times a word occurs in the sentence and returns a list of words which occur nr times or returns “There are no words in the sentence that occur nr times”.

defreverse_lookup(w_fr, nr):list1 = [ ]for word in w_fr:ifw_fr[word] == nr: list1.append(word)iflen(list1) > 0: return list1else: print "There are no words in the sentence that occur ”, nr, “times.”

Sorting a Dictionary According to its Values • First you need to import itemgetter:from operator import itemgetter • To go over each item in a dictionary you can use .iteritems( ) • To sort the dictionary according to the values, you need to use key = itemgetter(1) • To sort it decreasingly: reverse = True

from operator import itemgetterdefwords(s):w_fr = { }wordlist = s.split( )for word in wordlist:if word not in w_fr:w_fr[word] = 1else:w_fr[word] += 1 h = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True)return h

Inverting Dictionaries • It could be useful to invert a dictionary: keys and values switch placedef invert_dict(d): inv = { } for key in d: value = d[key] if value not in inv: inv[value] = [key] else: inv[value].append(key) return inv

But: lists can be values, but never keys!

GettingStartedwith NLTK • In IDLE: import nltknltk.download()

SearchingTexts • Start your script withimporting all texts in NLTK: fromnltk.book import * • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis • text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908

Any time you want to find out about these texts, just enter their names at the Python prompt:>>> text1<Text: Moby Dick by Herman Melville 1851> • A concordance view shows every occurrence of a given word, together with some context:e.g. “monstrous” in Moby Dicktext1.concordance(“monstrous”)

Try looking up the context of “lol” in the chat corpus (text 5) • If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time:e.g. the InauguralAddress Corpus (dates back to 1789): words like “nation”, “terror”, “God”…

You can also examine what other words appear in a similar context, e.g. text1.similar(“monstrous”) • common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g.text1.common_contexts([“very”, “monstrous”])

You can also determine the location of a word in the text • This positional information can be displayed using a dispersion plot • Each stripe represents an instance of a word, and each row represents the entire text, e.g. text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Counting Tokens • To count the number of tokens (words + punctuation marks), just use the len( ) function, e.g. len(text5) • To count the number of unique tokens, you have to make a set, e.g.set(text5)

If you want them sorted alfabetically, try this:sorted(set(text5)) • Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this)

Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5): 45010 tokens 6066 unique tokens or typesThe lexical diversity = nr of types/nr of tokens • Use the Python functions to calculate the lexical diversity of text 5

len(set(text5))/float(len(text5))

Frequency Distributions • To find n most frequent tokens: FreqDist( ), e.g.fdist = FreqDist(text1)fdist[“have”] 760all_tokens = fdist.keys( )all_tokens[:50] • The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text

Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …) • What proportion of the text is taken up with such words? Cumulative frequency plotfdist.plot(50, cumulative=True)

If frequent tokens do not give enough information, what about infrequent tokens?Hapaxes= tokens which occur only oncefdist.hapaxes( ) • Without their context, you do not get much information either

Fine-grained Selection of Tokens • Extract tokens of a certain minimum length:tokens = set(text1)long_tokens = [ ]for token in tokens: if len(token) >= 15: long_tokens.append(token) ORlong_tokens = list(token for token in tokens if len(token) >= 15)

BUT: very long words are often hapaxes • You can also extract frequently occurring long words of a certain length:words = set(text1)fdist = FreqDist(text1)freq_long_words = list(word for word in words if len(word) >= 7 and fdist[word] >= 7)

Collocations and Bigrams • A collocation is a sequence of words that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not • Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words:text8.collocations( )

Some Functions for NLTK's Frequency Distributions fdist = FreqDist(samples) fdist[“word”]  frequency of “word” fdist.freq(“word”)  frequency of “word” fdist.N( )  total number of samples fdist.keys( )  the samples sorted in order of decreasing frequency for sample in fdist:  iterates over the samples in order of decreasing frequency

fdist.max( )  sample with the greatest count fdist.plot( )  graphical plot of the frequency distribution fdist.plot(cumulative=True)  cumulative plot of the frequency distribution fdist1 < fdist2  tests if the samples in fdist1 occur less frequently than in fdist2

Accessing Corpora • NLTK also contains entire corpora, e.g.: • Brown Corpus • NPS Chat • Gutenberg Corpus • …A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Each of these corpora contains dozens of individual texts • To see which files are e.g. in the Gutenberg corpus in NLTK:nltk.corpus.gutenberg.fileids() • Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus

You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script:from nltk.corpus import gutenberg • After that you just have to use the name of the corpus and the dot notation before a functiongutenberg.fileids( )

If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) functionHamlet = gutenberg.words(“shakespeare-hamlet.txt”) • Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function • You can use some of the previously mentioned functions (corpus methods) on this text, e.g.fdist_hamlet = FreqDist(hamlet)

Some Corpus Methods in NLTK • brown.raw( )  raw data from the corpus file(s) • brown.categories( )  fileids( ) grouped per predefinedcategories • brown.words( )  a list of words and punctuationtokens • brown.sents( )  words( ) groupedintosentences

Programming for Linguists