200 likes | 358 Views
Programming for Linguists. An Introduction to Python 20/12/2012. Oef. 1
E N D
Programming for Linguists An Introduction to Python20/12/2012
Oef. 1 import string#removepunctuationfunctiondefremovePunct(sent):no_punct= sent.translate(None,string.punctuation) return no_punct#split sentenceintowordsfunctiondefgetWords(sent): words= sent.split() return words
#usepreviousfunctionsto get the average word lengthdefavWordLength(sent):#call removePunctfunctionno_punct= removePunct(sent)#use the result in the getWordsfunctionwords= getWords(no_punct)# workwith the resultfromgetWordslengths= []forw in words:lengths.append(len(w))av_word_length= sum(lengths)/float(len(words)) return av_word_length
Oef 2import redoc= open('/Users/claudia/Desktop/ my_text.txt', 'r’)my_text= doc.read( )deffindWords(text):pattern= r'(\S{0,}((aa|ee|oo|uu)\S{0,})’words= re.findall(pattern, text) return wordsdoubleVowels = findWords(my_text)
Oef 3fromcollections import defaultdict defwordFeats(text):short_val= 10long_val= 0short_word= 'geen’long_word= 'geen’hapaxes= [ ]wordFreqs = defaultdict(int)no_punct= removePunct(text)words= getWords(no_punct)forw in words:iflen(w) > long_val:long_word= wlong_val= len(w)iflen(w) < short_val:short_word= wshort_val= len(w)
wordFreqs[w] += 1forword in wordFreqs:ifwordFreqs[word] == 1:hapaxes.append(word) print 'shortest', short_word print 'longest', long_word print 'hapaxes', hapaxeswordFeats(my_text)
Oef 4.deffindWords2(text):no_punct= removePunct(text) pattern1 = r'((d|D)e|(H|h)et|(E|e)en)’ pattern2 = r'\S+dt’ pattern3 = r'[A-Z]\S+’ print re.findall(pattern1, no_punct) print re.findall(pattern2, no_punct) print re.findall(pattern3, no_punct)
Vorige lesOef 1. fromnltk import *fromnltk.corpus import gutenbergdefgetHapaxes(text):new_words = [word.lower() for word in gutenberg.words(text)]fdist= FreqDist(new_words) return fdist.hapaxes( )print getHapaxes('shakespeare-hamlet.txt’)
Oef 2 fromnltk.corpus import browncfd = nltk.ConditionalFreqDist((genre, word) forgenre in brown.categories( )for word in brown.words(categories =genre)) genres = [‘news’, ‘humor’, ‘government’, ‘science-fiction’ ] prons= [‘I’, ‘you’, ‘he’, ‘she’, ‘we’, ‘they’] cfd.tabulate(conditions=genres, samples=prons)
Oef 3. fromnltk.corpus import nps_chatdef findWords3(corpus): ok = [ ]words = corpus.words( )fdist = FreqDist(words)for word in fdist:iflen(word) > 5 andfdist[word] > 5:ok.append(word) return okprint findWords3(nps_chat)
Oef 4. fromnltk.corpus import PlaintextCorpusReader loc= “/Users/claudia/my_corpus” my_corpus = PlaintextCorpusReader(loc, “.*”) deflexDiv(corpus): results = [ ]forfileid in corpus.fileids( ):totalWords = len(corpus.words(fileid))uniqueWords = len(set(corpus.words(fileid))) results.append(uniqueWords/float(totalWords)) return sum(results)/len(results) print lexDiv(my_corpus)
Oef 5. from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” my_corpus= CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt’,cat_pattern=r’(10s| 20s |30s)/.*') cfd = nltk.ConditionalFreqDist((category, word) forcategory in my_corpus.categories( ) forword in my_corpus.words(categories=category)) subcats= my_corpus.categories( ) chat = [‘lol’, ‘omg’, ‘brb’] cfd.tabulate(conditions=subcats, samples=chat)
Dispersion Plot • determinethe location of a word in the text: howmanywordsfrom the beginningitappears
Remove stopwords import nltk from nltk.book import * from nltk.corpus import stopwords stopList= stopwords.words(“english”) How do you remove these stopwords from e.g. the nps_chat corpus’ words?
fromnltk.corpus import nps_chat words = nps_chat.words( ) filtered = [word for word in wordsif word not in stopList]
Further Reading • Since this was only a short introduction to programming in Python, if you want to expand your programming skills further, see: • http://docs.python.org/2/ (official Python website) • http://stackoverflow.com/ (questions forum)
Think Python. How to Think Like a Computer Scientist?http://www.greenteapress.com/thinkpython/ • NLTK bookhttp://nltk.org/book/
If you are interested in our work in computational linguistics/doing your thesis:http://www.clips.ua.ac.be/http://www.clips.ua.ac.be/projects