510 likes | 917 Views
A Survey on Text Classification. December 10, 2003 20033077 Dongho Kim KAIST. Contents. Introduction Statistical Properties of Text Feature Selection Feature Space Reduction Classification Methods Using SVM and TSVM Hierarchical Text Classification Summary. Introduction.
E N D
A Survey on Text Classification December 10, 2003 20033077 Dongho Kim KAIST
Contents • Introduction • Statistical Properties of Text • Feature Selection • Feature Space Reduction • Classification Methods • Using SVM and TSVM • Hierarchical Text Classification • Summary
Introduction • Text classification • Assign text to predefined categories based on content • Types of text • Documents (typical) • Paragraphs • Sentences • WWW-Sites • Different types of categories • By topic • By function • By author • By style
Computer-Based Text Classification Technologies • Naive word-matching (Chute, Yang, & Buntrock 1994) • Finding shared words between the text and names of categories • Weakest method • Cannot capture any conceptually relation • Thesaurus-based matching (Lindberg & Humphreys 1990) • Using lexical links • Insensitive to the context • High cost and low adaptivity across domains
Computer-Based Text Classification Technologies • Empirical learning of term-category associations • Learning from a training set • Fundamentally different from word-matching • Statistically capturing the semantic association between terms and categories • Context sensitive mapping from terms to categories • For example, • Decision tree methods • Bayesian belief networks • Neural networks • Nearest neighbor classification methods • Least-squares regression techiniques
Statistical Properties of Text • There are stable, language-independent patterns in how people use natural language • A few words occur very frequently; most occur rarely • In general • Top 2 words : 10~15% of all word occurrences • Top 6 words : 20% of all word occurrences • Top 50 words : 50% of all word occurrences Most common words from Tom Sawyer 1 14
Statistical Properties of Text • The most frequent words in one corpus may be rare words in another corpus • Example : ‘computer’ in CACM vs. National Geographic • Each corpus has a different, fairly small “working vocabulary” These properties hold in a wide range of languages
Statistical Properties of Text • Summary : • Term usage is highly skewed, but in a predictable pattern • Why is it important to know the characteristics of text? • Optimization of data structures • Statistical retrieval algorithms depend on them
Statistical Profiles • Can act as a summarization device • Indicate what a document is about • Indicate what a collection is about
Zipf’s Law • Zipf’s Law relates a term’s frequency to its rank • Frequency 1/rank • There is a constant such that • Rank the terms in a vocabulary by frequency, in descending order • Empirical observation : • Hence : • for English
retrieved + - + + - + + - Precision and Recall Evaluation Metrics • Recall • Percentage of all relevant documents that are found by a search • Precision • Percentage of retrieved documents that are relevant
F-measure Evaluation Metrics Harmonic average of precision and recall • Rewards results that keep recall and precision close together • R=40, P=60. R/P average=50. F-measure=48 • R=45, P=55. R/P average=50. F-measure=49.5
Break Even Point Evaluation Metrics • The point at which recall equals precision Evaluation metric : The value of this point
Feature Selection Term Weights: A Brief Introduction • The words of a text are not equally indicative of its meaning • Important: butterflies, monarchs, scientists, direction, compass • Unimportant : most, think, kind, sky, determine, cues, learn • Term weights reflect the (estimated) importance of each term “Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north. Scientists think that butterflies may use other cues, such as the earth’s magnetic field, but we have a lot to learn about monarchs’ sense of direction.”
Term Weights Feature Selection • Term frequency (TF) • The more often a word occurs in a document, the better that term is in describing what the document is about • Often normalized, e.g. by the length of the document • Sometimes biased to range [0.4..1.0] to represent the fact that even a single occurrence of a term is a significant event
Term Weights Feature Selection • Inverse document frequency (IDF) • Terms that occur in many documents in the collection are less useful for discriminating among documents • Document frequency (df) : number of documents containing the term • IDF often calculated as • TF and IDF are used in combination as product
Vector Space Similarity Feature Selection • Similarity is inversely related to the angle between the vectors • Cosine of the angle between the two vectors
Feature Space Reduction • Main reasons • Improve accuracy of the algorithm • Decrease the size of data set • Control the computation time • Avoid overfitting • Feature space reduction technique • Stopword removal, stemming • Information gain • Natural language processing
Stopword Removal Feature Space Reduction • Stopwords : words that are discarded from a document representation • Function words : a, an, and, as, for, in, of, the, to, … • About 400 words in English • Other frequent words : ‘Lotus’ in a Lotus Support
Stemming Feature Space Reduction • Group morphological variants • Plural : ‘streets’ ‘street’ • Adverbs : ‘fully’ ‘full’ • Other inflected word forms : ‘goes’ ‘go’ • Grouping process is called “conflation” • Current stemming algorithms make mistakes • Conflating terms manually is difficult, time-consuming • Automatic conflation using rules • Porter Stemmer • Porter stemming example : ‘police’, ‘policy’ ‘polic’
Information Gain Feature Space Reduction • Measuring information obtained by presence or absence of a term in a document • Feature space reduction by thresholding • Biased to common term large reduction in size of data set cannot be achieved
Natural Language Processing Feature Space Reduction • Pick out the important words from a document • For example, nouns, proper nouns, or verbs • Ignoring all other parts • Not biased to common terms reduction in bath feature space and size of data • Named entities • The subset of proper nouns consisting of people, locations, and organization • Effective in cases of news story classification
Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Experimental Results • Data set • From six news media sources • Two print sources (New York Times and Associated Press Wire) • Two television sources (ABC World News Tonight and CNN Headline News) • Two radio sources (Public Radio International and Voice of America)
Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Experimental Results • Results • NLP significant loss in recall and precision • SVM >> kNN (using full text or information gain) • Binary weighting significant loss in recall
kNN Classification Methods • Stands for k-nearest neighbor classification • Algorithms • Given a test document, • Find k nearest neighbors among training documents • Calculate and sort score of candidate categories • Thresholding on these scores • Decision rule
LLSF Classification Methods • Stands for Linear Least Squares Fit • Obtain matrix of word-category regression coefficients by LLSF • FLS : arbitrary document vector of weighted categories • By thresholding like kNN, assign categories
Naïve Bayes Classification Methods • Assumption • Words are drawn randomly from class dependent lexicons (with replacement) • Word independence • Result Word independence Classification rule
Estimating the Parameters Naïve Bayes • Count frequencies in training data • Estimating P(Y) • Fraction of positive / negative examples in training data • Estimating P(W|Y) • Smoothing with Laplace estimate
Yiming Yang and Xin Liu, A re-examination of text categorization methods, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Experiment Results
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Text Classification using SVM • A statistical learning model of text classification with SVMs: 0 if linearly separable
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Properties 1+2: Sparse Examples in High Dimension • High dimensional feature vectors (30,000 features) • Sparse document vectors : only a few words of the whole language occur in each document • SVMs use overfitting protection which does not depend on the dimension of feature
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 3: Heterogeneous Use of Words No pair of documents shares any words, but ‘it’, ‘the’, ‘and’, ‘of’, ‘for’, ‘an’, ‘a’, ‘not’, ‘that’, ‘in’.
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 4: High Level of Redundancy Few features are irrelevant! : Feature space reduction causes loss of information
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 5: ‘Zipf’s Law’ Most words occur very infrequently!
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts Modeling real text-classification tasks Used for previous proof TCat([20:20:100], # high freq. [4:1:200],[1:4:200],[5:5:600]. # medium freq. [9:1:3000],[1:9:3000],[10:10:4000] # low freq. )
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts • Margin of Tcat-Concepts • By Zipf’s law, we can bound R2 • Intuitively, many words with low frequency relatively short document vectors with Linearly separable
T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts • Bound on Expected Error of SVM
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Text Classification using TSVM • How would you classify the test set? • Training set {D1, D6} • Test set {D2, D3, D4, D5}
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Why Does Adding Test Examples Reduce Error?
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Data set • Reuter-21578 dataset-ModApte • Training : 9,603 test : 3,299 • WebKB collection of WWW pages • Only the class ‘course’, ‘faculty’, ‘project’, ‘student’ are used • Stemming and stopword removal are not used • Ohsumed corpus compiled by William Hersh • Training : 10,000 test : 10,000
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Results P/R-breakeven point for Reuters categories
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Results Average P/R-breakeven point on WebKB Average P/R-breakeven point on Ohsumed
Hierarchical Text Classification • Real world classification complex hierarchical structure • Due to difficulties of training for many classes or features Class 1-1 Class 1 Class 1-2 documents … Class 2 Class 1-3 Class 2-1 Class 3 … … Level 1 Level 2
Hierarchical Text Classification • More accurate specialized classifiers ‘computer’ : not discriminating Hardware documents Computers Software Chat Sports Soccer Football ‘computer’ : discriminating
S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Setting • Data set : LookSmart’s web directory • Using short summary from search engine • 370597 unique pages • 17173 categories • 7-level hierarchy • Focus on 13 top-level and 150 second-level categories
S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Setting • Using SVM • Posterior probabilities by regularized maximum likelihood fitting • Combining probabilities from the first and second level • Boolean scoring function, P(L1) && P(L2) or, • Multiplicative scoring function, P(L1) * P(L2)
S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Results • Non-hierarchical (baseline) : F1 = 0.476 • Hierarchical • Top-level • Training set : F1 = 0.649 • Test set : F1 = 0.572 • Second-level • Multiplicative : F1 = 0.495 • Boolean : F1 = 0.497 • Assuming top-level classification is correct, • F1 = 0.711
Summary • Feature space reduction • Performance of SVM and TSVM is better than others • TSVM has merits in text classification • Hierarchical classification is helpful • Other issues • Sampling strategies • Other kinds of feature selection
Reference • T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. • T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999. • T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. • Robert Cooley, Classification of News Stories Using Support Vector Machines (1999). Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Text Mining Workshop, August 1999. • Yiming Yang and Xin Liu, A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999. • S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263.