CS 479, Section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, Section 1:Natural Language Processing Lecture #17: Text Classification; Naïve Bayes Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

Announcements • Reading Report #7 • M&S 7 • Due: today • Mid-term Exam • Next Thu-Sat • Review on Wed • Prepare your 5 questions • Project #2, Part 1 • No pair programming; you may collaborate (must acknowledge); do your own work • Help session: Tuesday, CS Conference Room, 4pm • Early: Monday after the mid-term • Due: Wednesday after the mid-term • ACM Programming Contest: • Saturday, Oct. 13 • http://acm.byu.edu/competition

Objectives • Introduce the problem of text classification • Introduce the Naïve Bayes model • Revisit log-domain arithmetic

Overview • So far: n-gram language models • Model fluency for noisy-channel processes (ASR, MT, etc.) • No representation of language structure or meaning • Now: Naïve Bayes models • Introduce a single new global variable ( for class label) • Model a (hidden) global property of text (the label) • Still a very simplistic model family

Text Classification • Goal: classify documents into broad semantic classes(e.g., sports, entertainment, technology, politics, etc.) • Which one is the politics document? • And how much deep processing did that decision require? • Motivates an approach: bag-of-words, Naïve-Bayes models • Another approach in an upcoming lecture … Democratic vice presidential candidate John Edwards on Sunday accused President Bush and Vice President Dick Cheney of misleading Americans by implying a link between deposed Iraqi President Saddam Hussein and the Sept. 11, 2001 terrorist attacks. While No. 1 Southern California and No. 2 Oklahoma had no problems holding on to the top two spots with lopsided wins, four teams fell out of the rankings — Kansas State and Missouri from the Big 12 and Clemson from the Atlantic Coast Conference and Oregon from the Pac-10.

Naïve-Bayes Models • Idea: pick a class, then generate a document using a language model given that class. What are the independence assumptions in this model?

Naïve-Bayes Models • Naïve-Bayes assumption: all words are conditionally independent of one another given the class. • Compare to a unigram language model: We have to smooth these! c w1 w2 wn wn = STOP . . .

Estimating Class Probabilitywith Naïve Bayes • For a chosen set of classes: • We have a joint model of class label and document: • We can easily compute the posterior probability of a class given a document (it’s just a conditional query on the model):

Classifying with Naïve Bayes • Given document d,

Log(arithmic) Domain Photo credit: Nathan Davis and Aaron Davis, Spring 2007, Google Campus, Mountain View, CA

Classifying using Log Domain

Practical Matters • How easy is Naïve Bayes to train? to test? • What should we do with unknown words? • Can work shockingly well for text classification (esp. in the wild). • How about NB for spam detection? • Can you use NB for word-sense disambiguation • How can unigram models be so terrible for language modeling, but class-conditional unigram models work for text classification?

Insight for Project #2.1 • Think of these local model terms as a class-dependent unigram model

Proper Name Classification • Movie Beastie Boys: Live in Glasgow • Person Michelle Ford-Eriksson • Place Ramsbury • Place Market Bosworth • Drug Dilotab • Drug Cyanide Antidote Package • Person Bill Johnson • Place Ettalong • Movie The Suicide Club • Place Pézenas • Company AMLI Residential Properties Trust • Drug Diovan • Place Bucknell • MovieMarie, Nonna, la vierge et moi • Person Chevy Chase c c1 c2 cn . . . Character-levelevidence i.e., “features”

Next • Read the project requirements • Start working through the tutorial • The Mid-term exam covers up through using Naïve Bayes for classification

CS 479, Section 1: Natural Language Processing