A Survey on Text Classification

A Survey on Text Classification December 10, 2003 20033077 Dongho Kim KAIST

Contents • Introduction • Statistical Properties of Text • Feature Selection • Feature Space Reduction • Classification Methods • Using SVM and TSVM • Hierarchical Text Classification • Summary

Introduction • Text classification • Assign text to predefined categories based on content • Types of text • Documents (typical) • Paragraphs • Sentences • WWW-Sites • Different types of categories • By topic • By function • By author • By style

Text Classification Example

Computer-Based Text Classification Technologies • Naive word-matching (Chute, Yang, & Buntrock 1994) • Finding shared words between the text and names of categories • Weakest method • Cannot capture any conceptually relation • Thesaurus-based matching (Lindberg & Humphreys 1990) • Using lexical links • Insensitive to the context • High cost and low adaptivity across domains

Computer-Based Text Classification Technologies • Empirical learning of term-category associations • Learning from a training set • Fundamentally different from word-matching • Statistically capturing the semantic association between terms and categories • Context sensitive mapping from terms to categories • For example, • Decision tree methods • Bayesian belief networks • Neural networks • Nearest neighbor classification methods • Least-squares regression techiniques

Statistical Properties of Text • There are stable, language-independent patterns in how people use natural language • A few words occur very frequently; most occur rarely • In general • Top 2 words : 10~15% of all word occurrences • Top 6 words : 20% of all word occurrences • Top 50 words : 50% of all word occurrences Most common words from Tom Sawyer 1 14

Statistical Properties of Text • The most frequent words in one corpus may be rare words in another corpus • Example : ‘computer’ in CACM vs. National Geographic • Each corpus has a different, fairly small “working vocabulary” These properties hold in a wide range of languages

Statistical Properties of Text • Summary : • Term usage is highly skewed, but in a predictable pattern • Why is it important to know the characteristics of text? • Optimization of data structures • Statistical retrieval algorithms depend on them

Statistical Profiles • Can act as a summarization device • Indicate what a document is about • Indicate what a collection is about

Zipf’s Law • Zipf’s Law relates a term’s frequency to its rank • Frequency 1/rank • There is a constant such that • Rank the terms in a vocabulary by frequency, in descending order • Empirical observation : • Hence : • for English

retrieved + - + + - + + - Precision and Recall Evaluation Metrics • Recall • Percentage of all relevant documents that are found by a search • Precision • Percentage of retrieved documents that are relevant

F-measure Evaluation Metrics Harmonic average of precision and recall • Rewards results that keep recall and precision close together • R=40, P=60. R/P average=50. F-measure=48 • R=45, P=55. R/P average=50. F-measure=49.5

Break Even Point Evaluation Metrics • The point at which recall equals precision Evaluation metric : The value of this point

Feature Selection Term Weights: A Brief Introduction • The words of a text are not equally indicative of its meaning • Important: butterflies, monarchs, scientists, direction, compass • Unimportant : most, think, kind, sky, determine, cues, learn • Term weights reflect the (estimated) importance of each term “Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north. Scientists think that butterflies may use other cues, such as the earth’s magnetic field, but we have a lot to learn about monarchs’ sense of direction.”

Term Weights Feature Selection • Term frequency (TF) • The more often a word occurs in a document, the better that term is in describing what the document is about • Often normalized, e.g. by the length of the document • Sometimes biased to range [0.4..1.0] to represent the fact that even a single occurrence of a term is a significant event

Term Weights Feature Selection • Inverse document frequency (IDF) • Terms that occur in many documents in the collection are less useful for discriminating among documents • Document frequency (df) : number of documents containing the term • IDF often calculated as • TF and IDF are used in combination as product

Vector Space Similarity Feature Selection • Similarity is inversely related to the angle between the vectors • Cosine of the angle between the two vectors

Feature Space Reduction • Main reasons • Improve accuracy of the algorithm • Decrease the size of data set • Control the computation time • Avoid overfitting • Feature space reduction technique • Stopword removal, stemming • Information gain • Natural language processing

Stopword Removal Feature Space Reduction • Stopwords : words that are discarded from a document representation • Function words : a, an, and, as, for, in, of, the, to, … • About 400 words in English • Other frequent words : ‘Lotus’ in a Lotus Support

Stemming Feature Space Reduction • Group morphological variants • Plural : ‘streets’  ‘street’ • Adverbs : ‘fully’  ‘full’ • Other inflected word forms : ‘goes’  ‘go’ • Grouping process is called “conflation” • Current stemming algorithms make mistakes • Conflating terms manually is difficult, time-consuming • Automatic conflation using rules • Porter Stemmer • Porter stemming example : ‘police’, ‘policy’  ‘polic’

Information Gain Feature Space Reduction • Measuring information obtained by presence or absence of a term in a document • Feature space reduction by thresholding • Biased to common term  large reduction in size of data set cannot be achieved

Natural Language Processing Feature Space Reduction • Pick out the important words from a document • For example, nouns, proper nouns, or verbs • Ignoring all other parts • Not biased to common terms  reduction in bath feature space and size of data • Named entities • The subset of proper nouns consisting of people, locations, and organization • Effective in cases of news story classification

Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Experimental Results • Data set • From six news media sources • Two print sources (New York Times and Associated Press Wire) • Two television sources (ABC World News Tonight and CNN Headline News) • Two radio sources (Public Radio International and Voice of America)

Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Experimental Results • Results • NLP  significant loss in recall and precision • SVM >> kNN (using full text or information gain) • Binary weighting  significant loss in recall

kNN Classification Methods • Stands for k-nearest neighbor classification • Algorithms • Given a test document, • Find k nearest neighbors among training documents • Calculate and sort score of candidate categories • Thresholding on these scores • Decision rule

LLSF Classification Methods • Stands for Linear Least Squares Fit • Obtain matrix of word-category regression coefficients by LLSF • FLS : arbitrary document  vector of weighted categories • By thresholding like kNN, assign categories

Naïve Bayes Classification Methods • Assumption • Words are drawn randomly from class dependent lexicons (with replacement) • Word independence • Result Word independence Classification rule

Estimating the Parameters Naïve Bayes • Count frequencies in training data • Estimating P(Y) • Fraction of positive / negative examples in training data • Estimating P(W|Y) • Smoothing with Laplace estimate

Yiming Yang and Xin Liu, A re-examination of text categorization methods, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Experiment Results

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Text Classification using SVM • A statistical learning model of text classification with SVMs: 0 if linearly separable

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Properties 1+2: Sparse Examples in High Dimension • High dimensional feature vectors (30,000 features) • Sparse document vectors : only a few words of the whole language occur in each document • SVMs use overfitting protection which does not depend on the dimension of feature

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 3: Heterogeneous Use of Words No pair of documents shares any words, but ‘it’, ‘the’, ‘and’, ‘of’, ‘for’, ‘an’, ‘a’, ‘not’, ‘that’, ‘in’.

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 4: High Level of Redundancy Few features are irrelevant! : Feature space reduction causes loss of information

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 5: ‘Zipf’s Law’ Most words occur very infrequently!

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts Modeling real text-classification tasks Used for previous proof TCat([20:20:100], # high freq. [4:1:200],[1:4:200],[5:5:600]. # medium freq. [9:1:3000],[1:9:3000],[10:10:4000] # low freq. )

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts • Margin of Tcat-Concepts • By Zipf’s law, we can bound R2 • Intuitively, many words with low frequency  relatively short document vectors with Linearly separable

T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. TCat Concepts • Bound on Expected Error of SVM

T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Text Classification using TSVM • How would you classify the test set? • Training set {D1, D6} • Test set {D2, D3, D4, D5}

T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Why Does Adding Test Examples Reduce Error?

T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Data set • Reuter-21578 dataset-ModApte • Training : 9,603 test : 3,299 • WebKB collection of WWW pages • Only the class ‘course’, ‘faculty’, ‘project’, ‘student’ are used • Stemming and stopword removal are not used • Ohsumed corpus compiled by William Hersh • Training : 10,000 test : 10,000

T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Results P/R-breakeven point for Reuters categories

T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Experiment Results • Results Average P/R-breakeven point on WebKB Average P/R-breakeven point on Ohsumed

Hierarchical Text Classification • Real world classification  complex hierarchical structure • Due to difficulties of training for many classes or features Class 1-1 Class 1 Class 1-2 documents … Class 2 Class 1-3 Class 2-1 Class 3 … … Level 1 Level 2

Hierarchical Text Classification • More accurate specialized classifiers ‘computer’ : not discriminating Hardware documents Computers Software Chat Sports Soccer Football ‘computer’ : discriminating

S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Setting • Data set : LookSmart’s web directory • Using short summary from search engine • 370597 unique pages • 17173 categories • 7-level hierarchy • Focus on 13 top-level and 150 second-level categories

S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Setting • Using SVM • Posterior probabilities by regularized maximum likelihood fitting • Combining probabilities from the first and second level • Boolean scoring function, P(L1) && P(L2) or, • Multiplicative scoring function, P(L1) * P(L2)

S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Experiment Results • Non-hierarchical (baseline) : F1 = 0.476 • Hierarchical • Top-level • Training set : F1 = 0.649 • Test set : F1 = 0.572 • Second-level • Multiplicative : F1 = 0.495 • Boolean : F1 = 0.497 • Assuming top-level classification is correct, • F1 = 0.711

Summary • Feature space reduction • Performance of SVM and TSVM is better than others • TSVM has merits in text classification • Hierarchical classification is helpful • Other issues • Sampling strategies • Other kinds of feature selection

Reference • T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. • T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999. • T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. • Robert Cooley, Classification of News Stories Using Support Vector Machines (1999). Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Text Mining Workshop, August 1999. • Yiming Yang and Xin Liu, A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999. • S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263.

A Survey on Text Classification

A Survey on Text Classification

Presentation Transcript

Automatic Text Classification

Text Classification

Special topics on text mining [ Part I: text classification ]

Special topics on text mining [ Part I: text classification ]

A Semantic Text Classification Based on DBpedia

TEXT CLASSIFICATION

Text Classification

Text Classification

On Compression-Based Text Classification

Text Classification

Text Classification

Special topics on text mining [ Part I: text classification ]

A Survey on Text Categorization with Machine Learning

Text Classification

Text Survey Organizer

A Survey on Classification of Feature Selection Strategies

SURVEY ON UNIVERSITIES SNA93 sector classification

Text Classification

Classification Text

Text Classification

A Survey on Automatic Text/Speech Summarization

TEXT CLASSIFICATION