670 likes | 836 Views
Programming for Linguists. An Introduction to Python 08/12/2011. Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest.
E N D
Programming for Linguists An Introduction to Python08/12/2011
Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest
Ex. 1)def word_length( ):count=5 list1 = [ ]whilecount > 0: s= raw_input( "Please enter a word ”) list1.append(s)count= count-1longest= list1[0]shortest= list1[0]for word in list1:iflen(word) > len(longest):longest=wordeliflen(word) < len(shortest):shortest=word print shortest,"is the shortest word.” print longest,"is the longest word."
Ex 2) Write a function that takes a sentence as an argument and calculates the average word length of the words in that sentence
Ex 2)def awl(sent):wlist = [ ] sentence = sent.split( ) for word in sentence:wlist.append(len(word)) mean = sum(wlist)/float(len(wlist)) print “The average word length is ”,meanawl(“this is a test sentence”)
Ex 3) Take a short text of about 5 sentences. Write a script that will split up the text into sentences (tip: use the punctuation as boundaries) and calculates the average sentence length, the average word length and the standard deviation for both values
Ex 3)import redefmean(list):mean = sum(list)/float(len(list)) return meandef SD(list):devs = [ ]for item in list:std = (item – mean(list))**2devs.append(std) SD = (sum(devs) / float(len(devs))**0.5 return SD
defstatistics(sent):asl = [ ]awl= [ ]sentences = re.split(r ‘[.!?]’, sent) forsentence in sentences[:-1]:sentence = re.sub(r ‘\W+’, ‘ ’,sentence) tokens = sentence.split( )asl.append(len(tokens))for token in tokens:awl.append(len(token)) print mean(asl), SD(asl) print mean(awl), SD(awl) statistics(“sentences”)
Dictionaries • Like a list, but more general • In a list the index has to be an integer, e.g. words[4] • In a dictionary the index can be almost any type • A dictionary is like a mapping between 2 sets: keys and values • function:dict( )
To create an empty list:list = [ ] • To create an empty dictionary:dictionary = { } • For example a dictionary containing English and Spanish words:eng2sp = { }eng2sp['one'] = 'uno’print eng2sp{'one': 'uno'}
In this case both the keys and the values are of the string type • Like with lists, you can create dictionaries yourselves, e.g.eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}print eng2sp • Note: in general, the order of items in a dictionary is unpredictable
You can use the keys to look up the corresponding values, e.g.print eng2sp['two'] • The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter • If the key is not in the dictionary you get an error message, e.g.print eng2sp[‘ten’]KeyError: ‘ten’
The len( ) function returns the number of key-value pairslen(eng2sp) • The in operator tellsyouwhethersomethingappearsas a key in the dictionary‘one’ in eng2spTrue • BUT‘uno’ in eng2spFalse
To see whethersomethingappears as a value in a dictionary, youcanuse the values( ) function, which returns the values as a list, and thenuse the in operator, e.g.‘uno’ in eng2sp.values( )True
A Dictionary as a Set of Counters • Suppose you want to count the number of times each letter occurs in a string, you could: • create 26 variables, traverse the string and, for each letter, add 1 to the corresponding counter • create a dictionary with letters as keys and counters as the corresponding values
def frequencies(sent): freq_dict = { }for let in sent: if let not in freq_dict:freq_dict[let] = 1 else:freq_dict[let] += 1 return freq_dict frequencies(“abracadabra”)
The first line of the function creates an empty dictionary • The for loop traverses the string • Each time through the loop, if the letter is not in the dictionary, we create a new key item with the initial value 1 • If the letter is already in the dictionary we add 1 to its corresponding value
Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary
def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq words(“this is is a a test sentence”)
Reverse Lookup • Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”] • This operation is called a lookup • What if you know the value and want to look up the corresponding key?
Previous example: def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq w_fr = words(“this is is a a test sentence”)
Write a function which takes as argument the variable w_fr and the nr number of times a word occurs in the sentence and returns a list of words which occur nr times or returns “There are no words in the sentence that occur nr times”.
defreverse_lookup(w_fr, nr):list1 = [ ]for word in w_fr:ifw_fr[word] == nr: list1.append(word)iflen(list1) > 0: return list1else: print "There are no words in the sentence that occur ”, nr, “times.”
Sorting a Dictionary According to its Values • First you need to import itemgetter:from operator import itemgetter • To go over each item in a dictionary you can use .iteritems( ) • To sort the dictionary according to the values, you need to use key = itemgetter(1) • To sort it decreasingly: reverse = True
from operator import itemgetterdefwords(s):w_fr = { }wordlist = s.split( )for word in wordlist:if word not in w_fr:w_fr[word] = 1else:w_fr[word] += 1 h = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True)return h
Inverting Dictionaries • It could be useful to invert a dictionary: keys and values switch placedef invert_dict(d): inv = { } for key in d: value = d[key] if value not in inv: inv[value] = [key] else: inv[value].append(key) return inv
GettingStartedwith NLTK • In IDLE: import nltknltk.download()
SearchingTexts • Start your script withimporting all texts in NLTK: fromnltk.book import * • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis • text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908
Any time you want to find out about these texts, just enter their names at the Python prompt:>>> text1<Text: Moby Dick by Herman Melville 1851> • A concordance view shows every occurrence of a given word, together with some context:e.g. “monstrous” in Moby Dicktext1.concordance(“monstrous”)
Try looking up the context of “lol” in the chat corpus (text 5) • If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time:e.g. the InauguralAddress Corpus (dates back to 1789): words like “nation”, “terror”, “God”…
You can also examine what other words appear in a similar context, e.g. text1.similar(“monstrous”) • common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g.text1.common_contexts([“very”, “monstrous”])
You can also determine the location of a word in the text • This positional information can be displayed using a dispersion plot • Each stripe represents an instance of a word, and each row represents the entire text, e.g. text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
Counting Tokens • To count the number of tokens (words + punctuation marks), just use the len( ) function, e.g. len(text5) • To count the number of unique tokens, you have to make a set, e.g.set(text5)
If you want them sorted alfabetically, try this:sorted(set(text5)) • Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this)
Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5): 45010 tokens 6066 unique tokens or typesThe lexical diversity = nr of types/nr of tokens • Use the Python functions to calculate the lexical diversity of text 5
Frequency Distributions • To find n most frequent tokens: FreqDist( ), e.g.fdist = FreqDist(text1)fdist[“have”] 760all_tokens = fdist.keys( )all_tokens[:50] • The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text
Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …) • What proportion of the text is taken up with such words? Cumulative frequency plotfdist.plot(50, cumulative=True)
If frequent tokens do not give enough information, what about infrequent tokens?Hapaxes= tokens which occur only oncefdist.hapaxes( ) • Without their context, you do not get much information either
Fine-grained Selection of Tokens • Extract tokens of a certain minimum length:tokens = set(text1)long_tokens = [ ]for token in tokens: if len(token) >= 15: long_tokens.append(token) ORlong_tokens = list(token for token in tokens if len(token) >= 15)
BUT: very long words are often hapaxes • You can also extract frequently occurring long words of a certain length:words = set(text1)fdist = FreqDist(text1)freq_long_words = list(word for word in words if len(word) >= 7 and fdist[word] >= 7)
Collocations and Bigrams • A collocation is a sequence of words that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not • Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words:text8.collocations( )
Some Functions for NLTK's Frequency Distributions fdist = FreqDist(samples) fdist[“word”] frequency of “word” fdist.freq(“word”) frequency of “word” fdist.N( ) total number of samples fdist.keys( ) the samples sorted in order of decreasing frequency for sample in fdist: iterates over the samples in order of decreasing frequency
fdist.max( ) sample with the greatest count fdist.plot( ) graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 tests if the samples in fdist1 occur less frequently than in fdist2
Accessing Corpora • NLTK also contains entire corpora, e.g.: • Brown Corpus • NPS Chat • Gutenberg Corpus • …A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Each of these corpora contains dozens of individual texts • To see which files are e.g. in the Gutenberg corpus in NLTK:nltk.corpus.gutenberg.fileids() • Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus
You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script:from nltk.corpus import gutenberg • After that you just have to use the name of the corpus and the dot notation before a functiongutenberg.fileids( )
If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) functionHamlet = gutenberg.words(“shakespeare-hamlet.txt”) • Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function • You can use some of the previously mentioned functions (corpus methods) on this text, e.g.fdist_hamlet = FreqDist(hamlet)
Some Corpus Methods in NLTK • brown.raw( ) raw data from the corpus file(s) • brown.categories( ) fileids( ) grouped per predefinedcategories • brown.words( ) a list of words and punctuationtokens • brown.sents( ) words( ) groupedintosentences