180 likes | 280 Views
SET (5). Prof. Dragomir R. Radev radev@umich.edu. SET Fall 2013. … 9. Text classification Naïve Bayesian classifiers Decision trees …. Introduction. Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C
E N D
SET(5) Prof. Dragomir R. Radev radev@umich.edu
SET Fall 2013 … 9. Text classification Naïve Bayesian classifiers Decision trees …
Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)
Introduction • Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.
Bayes formula Full probability
Example (performance enhancing drug) • Drug(D) with values y/n • Test(T) with values +/- • P(D=y) = 0.001 • P(T=+|D=y)=0.8 • P(T=+|D=n)=0.01 • Given: athlete tests positive • P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074
Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically
Example • p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 • p(sneeze|well)=0.1 • p(sneeze|cold)=0.9 • p(sneeze|allergy)=0.9 • p(cough|well)=0.1 • p(cough|cold)=0.8 • p(cough|allergy)=0.7 • p(fever|well)=0.01 • p(fever|cold)=0.7 • p(fever|allergy)=0.4 Example from Ray Mooney
Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney
Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow
Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html
Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.
Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class
Well-known datasets • 20 newsgroups • http://qwone.com/~jason/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35
Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table
Readings • MRS18 • MRS17, MRS19