190 likes | 204 Views
This lecture covers topics such as smoothing, tokenizing text, replacing and correcting words, creating custom corpora, part-of-speech tagging, extracting chunks, transforming chunks and trees, text classification, distributed processing, and parsing specific data.
E N D
Lecture 6 Hidden Markov Models CSCE 771 Natural Language Processing • Topics • Smoothing again: • Readings: Chapters January 16, 2013
Overview • Last Time • NLTK book • http://readwrite.com/2011/03/25/python-is-an-increasingly-popu
Chomsky on You-tube http://www.youtube.com/watch?v=8mA4HYTO790
Python Text Processing with NLTK 2.0 Cookbook • Tokenizing Text and WordNet Basics • Replacing and Correcting Words • Creating Custom Corpora • Part-of-Speech Tagging • Extracting Chunks • Transforming Chunks and Trees • Text Classification • Distributed Processing and Handling Large Datasets • Parsing Specific Data
Chapter 1. Tokenizing Text and WordNet Basics • In this chapter, we will cover: • Tokenizing text into sentences • Tokenizing sentences into words • Tokenizing sentences using regular expressions • Filtering stopwords in a tokenized sentence • Looking up synsets for a word in WordNet • Looking up lemmas and synonyms in WordNet • Calculating WordNetsynset similarity • Discovering word collocations
Chapter 2. Replacing and Correcting Words • In this chapter, we will cover: Stemming words Lemmatizing words with WordNet Translating text with Babelfish Replacing words matching regular expressions Removing repeating characters Spelling correction with Enchant Replacing synonyms Replacing negations with antonyms • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 25). Packt Publishing. Kindle Edition.
Chapter 3. Creating Custom Corpora • In this chapter, we will cover: Setting up a custom corpus Creating a word list corpus Creating a part-of-speech tagged word corpus Creating a chunked phrase corpus Creating a categorized text corpus Creating a categorized chunk corpus reader Lazy corpus loading Creating a custom corpus view Creating a MongoDB backed corpus reader Corpus editing with file locking • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 45). Packt Publishing. Kindle Edition.
Chapter 4. Part-of-Speech Tagging • Default tagging • Training a unigram part-of-speech tagger • Combining taggers with backoff tagging • Training and combining • Ngramtaggers • Creating a model of likely word tags • Tagging with regular expressions • Affix tagging • Training a Brill tagger • Training the TnT tagger • Using WordNet for tagging Tagging proper names
Chapter 5. Extracting Chunks • Chapter 5. Extracting Chunks In this chapter, we will cover: Chunking and chinking with regular expressions Merging and splitting chunks with regular expressions Expanding and removing chunks with regular expressions Partial parsing with regular expressions Training a tagger-based chunker Classification-based chunking Extracting named entities Extracting proper noun chunks Extracting location chunks Training a named entity chunker • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 111). Packt Publishing. Kindle Edition.
Chapter 6. Transforming Chunks and Trees • In this chapter, we will cover: Filtering insignificant words Correcting verb forms Swapping verb phrases Swapping noun cardinals Swapping infinitive phrases Singularizing plural nouns Chaining chunk transformations Converting a chunk tree to text Flattening a deep tree Creating a shallow tree Converting tree nodes • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 143). Packt Publishing. Kindle Edition.
Chapter 7. Text Classification • Chapter 7. Text Classification In this chapter, we will cover: Bag of Words feature extraction Training a naive Bayes classifier Training a decision tree classifier Training a maximum entropy classifier Measuring precision and recall of a classifier Calculating high information words Combining classifiers with voting Classifying with multiple binary classifiers • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 167). Packt Publishing. Kindle Edition.
Chapter 8. Distributed Processing and Handling Large Datasets • In this chapter, we will cover: Distributed tagging with execnet Distributed chunking with execnet Parallel list processing with execnet Storing a frequency distribution in Redis Storing a conditional frequency distribution in Redis Storing an ordered dictionary in Redis Distributed word scoring with Redis and execnet • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 201). Packt Publishing. Kindle Edition.
Chapter 9. Parsing Specific Data • Chapter 9. Parsing Specific Data In this chapter, we will cover: Parsing dates and times with Dateutil Time zone lookup and conversion Tagging temporal expressions with Timex Extracting URLs from HTML with lxml Cleaning and stripping HTML Converting HTML entities with BeautifulSoup Detecting and converting character encodings • Perkins, Jacob (2010-11-09). Python Text Processing with NLTK 2.0 Cookbook (p. 227). Packt Publishing. Kindle Edition.