180 likes | 375 Views
NLTK & Python Day 5. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1.2 A Closer Look at Python: Texts as Lists of Words. Variables. variable = expression
E N D
NLTK & PythonDay 5 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization • I have requested that Python and NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University
NLPP §1.2 A Closer Look at Python: Texts as Lists of Words
Variables • variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold'] LING 681.02, Prof. Howard, Tulane University
How to name variables • Valid names (or identifiers) … • must start with a letter, optionally followed by digits or letters; • are case-sensitive; • cannot contain whitespace (use an underscore) or a dash (means minus); • cannot be a reserved word. LING 681.02, Prof. Howard, Tulane University
Strings • Strings are individual words, i.e. a single element list. • Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python'] LING 681.02, Prof. Howard, Tulane University
NLPP §1.3. Computing with Language: Simple Statistics
Frequency distribution • What is a frequency distribution? • It tells us the frequency of each vocabulary item in a text. • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. • What function in NLTK calculates it? • FreqDist(text_name) • What expression lists the tokens with their distribution? • text_name.keys() LING 681.02, Prof. Howard, Tulane University
Very frequent words • How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True) LING 681.02, Prof. Howard, Tulane University
Very infrequent words • Words that occur only once are called hapaxes. • >>>fdist1.hapaxes() • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others. • How would you describe them? LING 681.02, Prof. Howard, Tulane University
Summary LING 681.02, Prof. Howard, Tulane University
Question • Which group would you look in to find words that help you understand what the text is about? • Neither. LING 681.02, Prof. Howard, Tulane University
Fine-grained word selection • Some Python expressions are based on set theory. • {w | w ∈ V & P(w)} • [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?) • Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] LING 681.02, Prof. Howard, Tulane University
Finding words that characterize a text • Not too short (>?) and not too infrequent (>?) • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7] LING 681.02, Prof. Howard, Tulane University
Finding groups of words • What is the name for a sequence of two words? • Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] • What is the name for a sequence of words that occur together unusually often? • Collocation ~ collocations() • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words. LING 681.02, Prof. Howard, Tulane University
Example • >>> text4.collocations() • Building collocations list • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money LING 681.02, Prof. Howard, Tulane University
Counting Other Things LING 681.02, Prof. Howard, Tulane University
Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2