Text Categorization

Text Categorization Rong Jin

Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

Yahoo Shopping Categories

Spam Filtering • Two categories: spam or ham • Automatically decide the category for each incoming email

Text Categorization in IR • Many search engine functions are based TC • Language identification (English vs. French etc.) • Detecting spam pages (spam vs. nonspam) • Detecting sexually explicit content (sexually explicit vs. not) • Sentiment detection: positive or negative review • Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)

Text Categorization (TC) • Given: • A fixed set of categories C = {c1, c2, . . . , cJ} • The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). • A set of labeled documents (i.e., training data)

Text Categorization (TC) • Given: • A fixed set of categories C = {c1, c2, . . . , cJ} • The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). • A set of labeled documents (i.e., training data) • Predict the categories for new documents (i.e., test documents)

Text Categorization (TC) Given Prediction

(k=4) (k=1) K Nearest Neighbor Classifier

(k=4) (k=1) K-Nearest Neighbor Classifier • Keep all training examples • Find k examples that are most similar to the new document (“nearest neighbor” documents) • Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category)

K-Nearest Neighbor Classifier • Implementation issue • Searching the nearest neighbors could be time consuming when the number of training documents is large • Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2

(k=4) (k=1) K-Nearest Neighbor Classifier • Large K • Small variance: prediction is less sensitive to the given set of training documents • Large bias: prediction is less sensitive to the document content

(k=4) (k=1) K-Nearest Neighbor Classifier • Small K • Large variance: prediction is sensitive to the given set of training documents • Small bias: prediction is sensitive to the document content

K-Nearest Neighbor Classifier • Cross validation to determine K • Split labeled documents into training set (80%) and validation set (20%) • For each K in a given range • Predict the categories for docs in the validation set using the documents in the training set • Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) • Choose K with the smallest classification error

Cross Validation for K • K=1, error = 10 • K=2, error = 5 • K=3, error = 2 • K=4, error = 4 • K=5, error = 7 ------------------------------ Choose K= 3 20% 80% Predict Validation Set Training Set

Text Categorization