190 likes | 367 Views
NLTK & Python Day 4. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1 Language processing & Python §1.1 Computing with language. Loading the book's texts.
E N D
NLTK & PythonDay 4 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization • I have requested that Python and NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University
NLPP §1 Language processing & Python §1.1 Computing with language
Loading the book's texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 >>> LING 681.02, Prof. Howard, Tulane University
Searching text • Show every token of a word in context, called concordance view. • text1.concordance("monstrous") • Show the words that appear in a similar range of contexts. • text1.similar("monstrous") • Show the contexts that two words share. • text1.common_contexts("monstrous") LING 681.02, Prof. Howard, Tulane University
Searching text, cont. • Plot how far each token of a word is from the beginning of a text. • text1.dispersion_plot(["monstrous"]) • Needs NumPy & Matplotlib, though it didn't work for me. • Generate random text. • text1.generate() LING 681.02, Prof. Howard, Tulane University
Counting vocabulary • Count the word and punctuation tokens in a text: • len(text1) • List the distinct words, i.e. the word types, in a text: • set(text1) • Count how many types there are in a text: • len(set(text1)) • Count the tokens of a word type: • text1.count("smote") LING 681.02, Prof. Howard, Tulane University
Lexical richness or diversity • The lexical richness or diversity of a text can be estimated as tokens per type: • len(text1) / len(set(text1) • The frequency of a type can be estimated as tokens per all tokens: • 100 * text1.count('a') / len(text1) • This is integer division, however. • p. 8 "_future_" is some kind of error LING 681.02, Prof. Howard, Tulane University
Making your own function in Python • To save you from typing the same thing over and over, you can define your own function: >>> deflexical_diversity(text): ... returnlen(text1) / len(set(text1) • You call this function just by typing it and filling in the argument, a text name, in the parenthesis: >>> lexical_diversity(text1) LING 681.02, Prof. Howard, Tulane University
Other functions • Sort the word types in a text alphabetically: • sorted(set(text1)) LING 681.02, Prof. Howard, Tulane University
Exercises 1.8.… • 4. … How many words are there in text2? How many distinct words are there? • 5. Compare the lexical diversity scores for humor and romance fiction in Table 1.1. Which genre is more lexically diverse? • 8. Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation. LING 681.02, Prof. Howard, Tulane University
NLPP §1.2 A Closer Look at Python: Texts as Lists of Words
The representation of a text • We will think of a text as nothing more than a sequence of words and punctuation. • The opening sentence of Moby Dick: >>> sent1 = ['Call', 'me', 'Ishmael', '.'] • The bracketed material is known as a list in Python. • We can inspect it by typing the name. • How would you find out how many words it has? LING 681.02, Prof. Howard, Tulane University
List construction • Append one list to the end of another with '+', known as concatenation: >>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'] >>> sent4 + sent1 ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the','House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.'] • Append a single item to a list • >>> sent1.append("Some") • sent1 ['Call', 'me', 'Ishmael', '.', 'Some'] LING 681.02, Prof. Howard, Tulane University
List indexing • Each element in a list is numbered in sequence, a number known as the element's index. • Show the item that occurs at an index such as 173 in a text: >>> text4[173] 'awaken' • Show the index of an element's first occurrence: >>>text4.index('awaken') 173 • Show the elements between two indices (slicing): >>> text5[16715:16735] >>> text5[16715:] >>> text5[:16735] • Assign an element to an index: >>> text[0] = 'First' LING 681.02, Prof. Howard, Tulane University
Python counts from 0 • Create a list: >>> sent = ['word1', 'word2', 'word3', 'word4', 'word5', ... 'word6', 'word7', 'word8', 'word9', 'word10'] • Find the first word: >>> sent[0] 'word1' Find the last word: >>> sent[9] 'word10' • What does sent[10] do? • It produces a runtime error. LING 681.02, Prof. Howard, Tulane University
List exercises LING 681.02, Prof. Howard, Tulane University
Next time NLPP: finish §1 and do all exercises; do up to Ex 8 in §2