350 likes | 577 Views
Text Categorization. Karl Rees Ling 580 April 2, 2001. What is Text Categorization?. Classify the topic or theme of a document by categories/classes based on its content. Humans can do this intuitively, but how do you teach a machine to classify a text?. Useful Applications.
E N D
Text Categorization Karl Rees Ling 580 April 2, 2001
What is Text Categorization? • Classify the topic or theme of a document by categories/classes based on its content. • Humans can do this intuitively, but how do you teach a machine to classify a text?
Useful Applications • Filter a stream of news for a particular interest group. • Spam vs. Interesting mail. • Determining Authorship. • Poetry vs. Fiction vs. Essay, etc…
So, For Example: • Ch. 16 of Foundations of Statistical Language Processing. • We want to create an agent that can give us the probability of this document belonging to certain categories: • P(Poetry) = .015 • P(Mathematics) = .198 • P(Artificial Intelligence) = .732 • P(Utterly Confusing) = .989
Training Sets • Corpus of documents for which we already know the category. • Essentially, we use this to teach a computer. Like showing a child a few pictures of a dog and a few pictures of a cat and then pointing to your neighbors pet and asking her/him what kind of animal it is.
Data Representation Model • Represent each object (document) in the training set in the form (x, c), where x is a vector of measurements and c is the class label. • In other words, each document is represented as a vector of potentially weighted word counts.
Training Procedure • A procedure/function that chooses a document’s category from a family of classifiers (model class). Typically, the model class consists of two classifiers: c1 and c2, where c2 is NOT(c1). • For example: • g(x) = w * x + w0 • x is the vector of word counts, w * x is the dot product of x and a vector of weights (because we may attach more importance to certain words) and w0 is some threshold. • Choose c1 for g(x) > 0, otherwise, c2.
Test Set • After training the classifier, we want to test its accuracy on a test set. • Accuracy = Number of Objects Correctly Classified / Number of Objects Examined. • Precision = Number of Objects Correctly Assigned to a Specific Category / Number of Objects Assigned to a Category • Fallout = Number of Objects Incorrectly Assigned to a Category / Number of Objects NOT Belonging to that Category
Modeling • How should texts be represented? • Using all words leads to sparse statistics. • Some words are indicative of a label. • One approach: • For each label, collect all words in texts with that label. • Apply a mean square error test to determine whether a word occurs by chance in the texts. • Sort all words by the mean square error test and take the top n (say 20). • Idea is to select words that are correlated with a label. • Examples: for label earnings, words such as “profit.”
Reuters Collection • A common dataset in text classification is the Reuters collection: • Articles categorized into about 100 topics. • 9603 training examples, 3299 test examples. • Short texts, annotated with SGML. Available: http://www.research.att.com/~lewis/reuters21578.html
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5554" NEWID="11"> <DATE>26-FEB-1987 15:18:59.34</DATE> <TOPICS><D>earn</D></TOPICS> <TEXT> <TITLE>COBANCO INC <CBCO> YEAR NET</TITLE> <DATELINE> SANTA CRUZ, Calif., Feb 26 - </DATELINE><BODY>Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets 510.2 mln vs 479.7 mln Deposits 472.3 mln vs 440.3 mln Loans 299.2 mln vs 327.2 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter </BODY></TEXT></REUTERS>
For Reuters, Label = Earnings: • Format of document vector file • Each entry consists of 25 lines: • the document id • is the document in the training set (T) or in the evaluation set (E)? • is the document in the core training set (C) or in the validation set (V)? (X where this doesn't apply.) • is the document in the earnings category (Y) or not (N)? • feature weight for "vs" • feature weight for "mln" • feature weight for "cts" • feature weight for ";" • feature weight for "&" • feature weight for "000"
For Reuters, Label = Earnings • feature weight for "loss" • feature weight for "'" • feature weight for " • feature weight for "3" • feature weight for "profit" • feature weight for "dlrs" • feature weight for "1" • feature weight for "pct" • feature weight for "is" • feature weight for "s" • feature weight for "that" • feature weight for "net" • feature weight for "lt" • feature weight for "at" • semicolon (separator between entries)
{ docid 11 T C Y 5 5 3 3 3 4 0 0 0 4 0 3 2 0 0 0 0 3 2 0 ; } Vector For Example Document:
Classification Techniques • Decision Trees • Maximum Entropy Modeling • Perceptrons (Neural Networks) • K-Nearest Neighbor Classification (kNN) • Naïve Bayes • Support Vector Machines
Information • Measure of how much we “know” about an object, document, decision, etc… • At each successive node we have more information about the object’s classification.
Information • p = Number of objects in a set that belong to a certain category. • n = Number of objects in a set that don’t belong to that category. • I = Measure of the amount of information that we have about an object that is not in the set given. • I( p/(p+n) , n/(p+n) ) =
Information Gain • The amount of Information we gain from making a decision. • Each decision we make will give us two new sets, each with its own distinct Information value. There should be more Information in these sets than in the previous set, thus we build our tree based on Information Gain. Those decisions with the highest gain come first.
Information Gain • Gain(A) = I( p/(p+n) , n/(p+n) ) – Remainder(A) • A is the resulting state. • Remainder(A) is the average of the information in the resulting sets I = 1…v:
Decision Trees • At the bottom of our trees, we have leaf nodes. At each of these nodes, we compute the percentage of objects belonging to the node that fit into the category we are looking at. If it is greater than a certain percentage (say 50%), we say that all documents that fit into this node are in this category. Hopefully, though, the tree will give us more confidence than 50%.
Pruning • After growing a tree, we want to prune it down to a smaller size. We may want to get rid of nodes/decisions that don’t contribute any significant information (possibly node 6 and 7 in our example). We also want to get rid of decisions that are based on possibly insignificant details. These “overfit” the training set. For example, if there is only one document in the set that has both dlrs and pct, and this is in the earnings category, it would probably be a mistake to assume that all such documents are in earnings.
Bagging / Boosting • Obviously, there are many different ways to prune. Also, there are many other algorithms besides Information Gain for growing a decision tree. • Bagging or Boosting means generating many decision trees and averaging the results of running an object through each of these trees.
Maximum Entropy Modeling • Consult Slides at http://www.cs.jhu.edu/~hajic/courses/cs465/cs46520/ppframe.htm for more information about Maximum Entropy.
kNN • Basic idea: • Keep training set in memory. • Define a similarity metric. • At classification time, match unseen example against all examples in memory. • Select kbest matches. • Predict unseen example label as majority label of k retrieved example. • Example similarity metric:
kNN • Many variants on kNN. • Underlying idea is that abstraction (rules, parameters etc) is likely to loose information. • No abstraction, casebased reasoning. • Training fast, testing can be slow. • Potentially large memory requirements.
Naïve Bayes • Assumption that are features are independent of each other: • Here A is a document consisting of features A1 … An • l is the document label. • Fast training, fast evaluation. • Good when features are independent.
Support Vector Machines • SVMs are an interesting new classifier: • Similar to kNN. • Similar (ish) to maxent. • Idea: • Transform examples into a new space where they can be linearly separated. • Group examples into regions that all share the same label. • Base grouping in terms of training items (support vectors) that lie on the boundary. • Best grouping found automatically.