1 / 20

Programming for Linguists

Programming for Linguists. An Introduction to Python 20/12/2012. Oef. 1

rafe
Download Presentation

Programming for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Linguists An Introduction to Python20/12/2012

  2. Oef. 1 import string#removepunctuationfunctiondefremovePunct(sent):no_punct= sent.translate(None,string.punctuation) return no_punct#split sentenceintowordsfunctiondefgetWords(sent): words= sent.split() return words

  3. #usepreviousfunctionsto get the average word lengthdefavWordLength(sent):#call removePunctfunctionno_punct= removePunct(sent)#use the result in the getWordsfunctionwords= getWords(no_punct)# workwith the resultfromgetWordslengths= []forw in words:lengths.append(len(w))av_word_length= sum(lengths)/float(len(words)) return av_word_length

  4. Oef 2import redoc= open('/Users/claudia/Desktop/ my_text.txt', 'r’)my_text= doc.read( )deffindWords(text):pattern= r'(\S{0,}((aa|ee|oo|uu)\S{0,})’words= re.findall(pattern, text) return wordsdoubleVowels = findWords(my_text)

  5. Oef 3fromcollections import defaultdict defwordFeats(text):short_val= 10long_val= 0short_word= 'geen’long_word= 'geen’hapaxes= [ ]wordFreqs = defaultdict(int)no_punct= removePunct(text)words= getWords(no_punct)forw in words:iflen(w) > long_val:long_word= wlong_val= len(w)iflen(w) < short_val:short_word= wshort_val= len(w)

  6. wordFreqs[w] += 1forword in wordFreqs:ifwordFreqs[word] == 1:hapaxes.append(word) print 'shortest', short_word print 'longest', long_word print 'hapaxes', hapaxeswordFeats(my_text)

  7. Oef 4.deffindWords2(text):no_punct= removePunct(text) pattern1 = r'((d|D)e|(H|h)et|(E|e)en)’ pattern2 = r'\S+dt’ pattern3 = r'[A-Z]\S+’ print re.findall(pattern1, no_punct) print re.findall(pattern2, no_punct) print re.findall(pattern3, no_punct)

  8. Vorige lesOef 1. fromnltk import *fromnltk.corpus import gutenbergdefgetHapaxes(text):new_words = [word.lower() for word in gutenberg.words(text)]fdist= FreqDist(new_words) return fdist.hapaxes( )print getHapaxes('shakespeare-hamlet.txt’)

  9. Oef 2 fromnltk.corpus import browncfd = nltk.ConditionalFreqDist((genre, word) forgenre in brown.categories( )for word in brown.words(categories =genre)) genres = [‘news’, ‘humor’, ‘government’, ‘science-fiction’ ] prons= [‘I’, ‘you’, ‘he’, ‘she’, ‘we’, ‘they’] cfd.tabulate(conditions=genres, samples=prons)

  10. Oef 3. fromnltk.corpus import nps_chatdef findWords3(corpus): ok = [ ]words = corpus.words( )fdist = FreqDist(words)for word in fdist:iflen(word) > 5 andfdist[word] > 5:ok.append(word) return okprint findWords3(nps_chat)

  11. Oef 4. fromnltk.corpus import PlaintextCorpusReader loc= “/Users/claudia/my_corpus” my_corpus = PlaintextCorpusReader(loc, “.*”) deflexDiv(corpus): results = [ ]forfileid in corpus.fileids( ):totalWords = len(corpus.words(fileid))uniqueWords = len(set(corpus.words(fileid))) results.append(uniqueWords/float(totalWords)) return sum(results)/len(results) print lexDiv(my_corpus)

  12. Oef 5. from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” my_corpus= CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt’,cat_pattern=r’(10s| 20s |30s)/.*') cfd = nltk.ConditionalFreqDist((category, word) forcategory in my_corpus.categories( ) forword in my_corpus.words(categories=category)) subcats= my_corpus.categories( ) chat = [‘lol’, ‘omg’, ‘brb’] cfd.tabulate(conditions=subcats, samples=chat)

  13. Dispersion Plot • determinethe location of a word in the text: howmanywordsfrom the beginningitappears

  14. Remove stopwords import nltk from nltk.book import * from nltk.corpus import stopwords stopList= stopwords.words(“english”) How do you remove these stopwords from e.g. the nps_chat corpus’ words?

  15. fromnltk.corpus import nps_chat words = nps_chat.words( ) filtered = [word for word in wordsif word not in stopList]

  16. Questions?

  17. Further Reading • Since this was only a short introduction to programming in Python, if you want to expand your programming skills further, see: • http://docs.python.org/2/ (official Python website) • http://stackoverflow.com/ (questions forum)

  18. Think Python. How to Think Like a Computer Scientist?http://www.greenteapress.com/thinkpython/ • NLTK bookhttp://nltk.org/book/

  19. If you are interested in our work in computational linguistics/doing your thesis:http://www.clips.ua.ac.be/http://www.clips.ua.ac.be/projects

  20. Happy holidays and good luck with your exams

More Related