1 / 73

Text Classification

Text Classification. Elnaz Delpisheh Introduction to Computational Linguistics York University Department of Computer Science and Engineering January 5, 2020. Outline. Definition and applications Representing texts Pre-processing the text Text classification methods Naïve Bayes

rparkin
Download Presentation

Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classification Elnaz Delpisheh Introduction to Computational Linguistics York University Department of Computer Science and Engineering January 5, 2020

  2. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • Subjective text classification

  3. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  4. Text Classification-Definition • Text classification is the assignment of text documents to one or more predefined categories based on their content. • The classifier: • Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym) • Output: a learned classifier f:x  y Text document Classifier Class A Class B Class C Text document Text document

  5. Text Classification-Applications • Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other. • Classify business names by industry. • Classify student essays as A,B,C,D, or F. • Classify email as Spam, Other. • Classify email to tech staff as Mac, Windows, ..., Other. • Classify pdf files as ResearchPaper, Other • Classify documents as WrittenByReagan, GhostWritten • Classify movie reviews as Favorable,Unfavorable,Neutral. • Classify technical papers as Interesting, Uninteresting. • Classify jokes as Funny, NotFunny. • Classify web sites of companies by Standard Industrial Classification (SIC) code.

  6. Text Classification-Example • Best-studied benchmark: Reuters-21578 newswire stories • 9603 train, 3299 test documents, 80-100 words each, 93 classes • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)

  7. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  8. Text Classification-Representing Texts f( )=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... simplest useful ? What is the best representation for the document x being classified?

  9. Pre-processing the Text • Removing stop words • Punctuations, Prepositions, Pronouns, etc. • Stemming • Walk, walker, walked, walking. • Indexing • Dimensionality reduction

  10. f( )=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Representing text: a list of words f( )=y (argentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …

  11. Pre-processing the Text-Indexing • Using vector space model

  12. Indexing (Cont.)

  13. Indexing (Cont.) • tfc-weighting • It considers the normalized length of documents (M). • ltc-weighting • It considers the logarithm of the word frequency to reduce the effect of large differences in frequencies. • Entropy weighting

  14. Indexing-Word Frequency Weighting word freq • ARGENTINE 1986/87 GRAIN/OILSEEDREGISTRATIONS • BUENOS AIRES, Feb 26 • Argentinegrain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... If the order of words doesn’t matter, x can be a vector of word frequencies. “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the frequency of the i-th word in the vocabulary Categories: grain, wheat

  15. Pre-processing the Text-Dimensionality Reduction • Feature selection: It attempts to remove non-informative words. • Document frequency thresholding • Information gain • Latent Semantic Indexing • Singular Value Decomposition

  16. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  17. Text Classification-Methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks

  18. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  19. Text Classification-Naïve Bayes • Represent document x as list of words w1,w1,… • For each class y, build a probabilistic model Pr(X|Y=y) of “documents” Pr(X={argentine,grain...}|Y=wheat) = .... Pr(X={stocks,rose,in,heavy,...}|Y=nonWheat) = .... • To classify, find the y which was most likely to generate x—i.e., whichgives x the best score according to Pr(x|y) • f(x) = argmaxyPr(x|y)*Pr(y)

  20. Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? • Simplest useful process to generate a bag of words: • pick word 1according to Pr(W|Y) • repeat for word 2, 3, .... • each word is generated independently of the others (which is clearly not true) but means How to estimate Pr(W|Y)?

  21. Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? Estimate Pr(w|y) by looking at the data... • Terms: • This Pr(W|Y) is a multinomialdistribution • This use of m and p is a Dirichlet prior for the multinomial

  22. Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? for instance: m=1, p=0.5 This Pr(W|Y) is a multinomialdistribution This use of m and p is a Dirichlet prior for the multinomial

  23. Text Classification-Naïve Bayes • Putting this together: • for each document xi with label yi • for each word wij inxi • count[wij][yi]++ • count[yi]++ • count++ • to classify a new x=w1...wn, pick y with top score: key point: weonly need counts for words that actually appear in document x

  24. Naïve Bayes for SPAM filtering

  25. Naïve Bayes-Summary • Pros: • Very fast and easy-to-implement • Well-understood formally & experimentally • Cons: • Seldom gives the very best performance • “Probabilities” Pr(y|x) are not accurate • e.g., Pr(y|x) decreases with length of x • Probabilities tend to be close to zero or one

  26. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  27. The Curse of Dimensionality • How can you learn with so many features? • For efficiency (time & memory), use sparse vectors. • Use simple classifiers (linear or loglinear) • Rely on wide margins.

  28. Margin-based Learning + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - The number of features matters: but not if the margin is sufficiently wide and examples are sufficiently close to the origin (!!)

  29. Text Classification-Voted Perceptron • Text documents: X=<x1,x2,…,xk> • K= number of features • Two classes={yes, no} • W= < w1, w2,… ,wk> • Objective: • Learn a weight vector W and a threshold θ • If • Yes • Otherwise • No • If the answer is • Correct: • w1 ++ • Incorrect: There is a mistake. Correction is made: • xk+1 = xk + xiwk • k = k+1 • wk+1 = 1

  30. Voted Perceptron-Error Correction

  31. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  32. Lessons of the Voted Perceptron • Voted Perceptron shows that you can make few mistakes in incrementally learning as you pass over the data, if the examples x are small (bounded by R), some u exists that has large margin. • Why not look for this line directly? • Support vector machines: • find u to maximize the margin.

  33. Text Classification-Support vector Machines • Facts about support vector machines: • the “support vectors” are the xi’s that touch the margin. • the classifier can be written as • where the xi’s are the support vectors. • support vector machines often give very good results on topical text classification. + + + + + + + + + + + + + + - - + - - - - - - - - - - - - - -

  34. Support Vector Machine Results

  35. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  36. Text Classification-Decision Trees • Objective of decision tree learning • Learn a decision tree from a set of training data • The decision tree can be used to classify new examples • Decision tree learning algorithms • ID3 (Quinlan, 1986) • C4.5 (Quinlan, 1993) • CART (Breiman, Friedman, et. al. 1983) • etc.

  37. Decision Trees-CART • Creating the tree • Splitting the set of training vectors • The best splitter has the purest children • Diversity measure is entropy Where S: set of example; k:number of classes; P(Ci): the probability of examples in S that belong to Ci. • To find the best splitter each component of the document vector is considered in turn. • This process is repeated until no sets can be partitioned any further. • Each leaf is now assigned a class.

  38. Decision Trees-Example

  39. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  40. TF-IDF Representation • The results above use a particular way to represent documents: bag of words with TFIDF weighting • “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the “weight” of the i-th word in the vocabulary • for word w that appears in DF(w) documents out of N in a collection, and appears TF(w) times in the documents being represented use weight: • also normalize all vector lengths (||x||) to 1

  41. TF-IDF Representation • TF-IDF representation is an old trick from the information retrieval community, and often improves performance of other algorithms: • K- nearest neighbor algorithm • Rocchio’s algorithm

  42. Text Classification-K-nearest Neighbor • The nearest neighbor algorithm • Objective • To classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor. • Determine the largest similarity with any element in the training set: • Collect the subset of X that has highest similarity with :

  43. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  44. Text Classification-Rocchio’s Algorithm • classify using distance to centroid of documents from each class • Assign test document to the class with maximum similarity

  45. Support Vector Machine Results

  46. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  47. Text Classification-Neural Networks • A neural network text classifier is a collection of interconnected neurons representing documents that incrementally learn from its environment to categorize the documents. • A neuron X1 W1 y1 training set X2 W2 U =∑XiWi . . . y2 F(u) Wn y3 Xn

  48. Text Classification-Neural Networks • Forward and Backward propagation Forward propagation H1 u11 IM1 OS1(r) H2 IM2 OS2(r) vj . . . . . . OS3(r) uji Hm IMn Backward propagation

  49. Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

  50. Performance Evaluation • Performance of a classification algorithm can be evaluated in the following aspects • Predictive performance • How accurate is the learned model in prediction? • Complexity of the learned model • Run time • Time to build the model • Time to classify examples using the model • Here we focus on the predictive performance

More Related