140 likes | 348 Views
NLTK & Python Day 7. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that NLTK be installed on the computers in this room. NLPP §2 Accessing text corpora and lexical resources. §2.1 Accessing text corpora. What's that word.
E N D
NLTK & PythonDay 7 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization • I have requested that NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University
NLPP §2 Accessing text corpora and lexical resources §2.1 Accessing text corpora
What's that word • What is a corpus/corpora? • "large bodies of linguistic data" LING 681.02, Prof. Howard, Tulane University
Some corpora in NLTK • The Project Gutenberg electronic text archive • 25k free electronic books at http://www.gutenberg.org/ • Web and chat text • The Brown corpus • First 1M word e-corpus, from 500 sources • The Reuters corpus • The Inaugural Address corpus • Annotated text corpora • Corpora in other languages LING 681.02, Prof. Howard, Tulane University
Using corpora in NLTK • Only the corpora in the nltk.book corpus are formatted as lists and so can be arguments to NLTK functions. • To convert another corpus into a list, use: your_text_name = nltk.Text(corpus_name) LING 681.02, Prof. Howard, Tulane University
Basic corpus functionsTable 2.3 LING 681.02, Prof. Howard, Tulane University
Basic corpus functionsTable 2.3 LING 681.02, Prof. Howard, Tulane University
Code to get started >>> from nltk.corpus import gutenberg >>> >>> emma = gutenberg.words('austen-emma.txt') >>> >>> emma = nltk.Text(emma) >>> >>> emma.collocations() Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss Fairfax; young man; great deal; John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin; Colonel Campbell; Box Hill; Harriet Smith; William Larkins; Brunswick Square; young lady; young woman; Miss Hawkins LING 681.02, Prof. Howard, Tulane University
Loading your own corpusTable 2.3 LING 681.02, Prof. Howard, Tulane University
NLPP §2 Accessing text corpora and lexical resources §2.2 Conditional frequency distributions
Back to frequency • FreqDist(mylist) calculates the number of occurrences of each item in 'mylist'. • ConditionalFreqDist(mypairs) calculates the number of occurrences of each pair of items in 'mypairs', • where the pairing might be of author & word, genre & word, topic & word, etc.: condition & text LING 681.02, Prof. Howard, Tulane University
An example >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) LING 681.02, Prof. Howard, Tulane University
Next time NLPP: §2.3ff Do "Your Turn" up to p. 55 Exercises 2.8.2-4, 2.8.8