A Survey on Text Categorization with Machine Learning

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito

Introduction:Text Categorization • Many digital Texts are available • E-mail, Online news, Blog … • Need of Automatic Text Categorization is increasing • without human resource • Merits of time and cost

Introduction:Text Categorization • Application • Spam filter • Topic Categorization

Introduction:Machine Learning • Making Categorization rule automatically by Feature of Text • Types of Machine Learning (ML) • Supervised Learning • Labeling • Unsupervised Learning • Clustering

Introduction:flow of ML • Prepare training Text data with label • Feature of Text • Learn • Categorize new Text Label1 ？ Label2

Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion

Number of labels • Binary-label • True or False (Ex. spam or not) • Applied for other types • Multi-label • Many labels, butOne Text has one label • Overlapping-label • One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4

Types of labels • Topic Categorization • Basic Task • Compare individual words • Author Categorization • Sentiment Categorization • Ex) Review of products • Need more linguistic information

Feature of Text • How to express a feature of Text? • “Bag of Words” • Ignore an order of words • Structure • Ex) I like this car. | I don’t like this car. • “Bag of Words” will not work well • (d:document = text) • (t:term = word)

Preprocessing • Remove stop words • “the”“a”“for”… • Stemming • relational -> relate, truly -> true

Term Weighting • Term Frequency • Number of a term in a document • Frequent terms in a document seems to be important for categorization • tf・idf • Terms appearing in many documents are not useful for categorization

bad happy good Sentiment Weighting • For sentiment classification,weight a word as Positive or Negative • Constructing sentiment dictionary • WordNet [04 Kamps et al.] • Synonym Database • Using a distancefrom ‘good’ and ‘bad’ d (good, happy) = 2 d (bad, happy) = 4

Dimension Reduction • Size of feature vector is (#terms)*(#documents) • #terms ≒ size of dictionary • High calculation cost • Risk of overfitting • Best for training data ≠ Best for real data • Choosing effective feature • to improve accuracy and calculation cost

Dimension Reduction • df-threshold • Terms appearing in very few documents(ex.only one) are not important • Score • If t and cj are independent, Score is equal to Zero

Learning Algorithm • Many (Almost all?) algorithms are used in Text Categorization • Simple approach • Naïve Bayes • K-Nearest Neighbor • High performance approach • Boosting • Support Vector Machine • Hierarchical Learning

Naïve Bayes • Bayes Rule • This value is hard to calculate • ? • Assumption : each terms occurs independently

k-Nearest Neighbor • Define a “distance” of two Texts • Ex)Sim(d1, d2) = d1・d2 / |d1||d2| = cosθ • check k of high similarityTexts and categorize bymajority vote • If size of test data is larger, memory and search cost is higher d1 k=3 d2 θ

Boosting • BoosTexter [00 Schapire et al.] • Ada boost • making many “weak learner”s with different parameters • Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data • BoosTexter uses Decision Stump as “weak learner”

1. 2. 3. + + + + + + －－ + － + － + + + + －－－－ + + + + －－－－ + + + + －－－－－－－－ Simple example of Boosting

Support Vector Machine • Text Categorization with SVM[98 Joachims] • Maximize margin

Text Categorization with SVM • SVM works well for Text Categorization • Robustness for high dimension • Robustness for overfitting • Most Text Categorization problems are linearly separable • All of OHSUMED (MEDLINE collection) • Most of Reuters-21578 (NEWS collection)

Comparison of these methods • [02 Sebastiani] • Reuters-21578 (2 versions) • difference: number of Categories

Hierarchical Learning • TreeBoost[06 Esuli et al.] • Boosting algorithm for Hierarchical labels • Hierarchical labels and Texts with label as Training data • Applying AdaBoost recursively • Better classifier than ‘flat’ AdaBoost • Accuracy： 2-3% up • Time: training and categorization time down • Hierarchical SVM[04 Cai et al.]

TreeBoost root L2 L4 L1 L3 L11 L12 L41 L42 L43 L421 L422

Conclusion • Overview of Text Categorizationwith Machine Learning • Feature of Text • Learning Algorithm • Future Work • Natural Language Processing with Machine Learning, especially in Japanese • Calculation Cost

A Survey on Text Categorization with Machine Learning

A Survey on Text Categorization with Machine Learning

Presentation Transcript

Text Categorization

A Survey on Text Classification

CS 391L: Machine Learning Text Categorization

Text Categorization

Text Categorization (TC)

Learning for Text Categorization

Text Categorization

Text Categorization

Text Categorization

Text Categorization

text categorization

A Text Categorization Based on summarization Technique

Text Mining with Machine Learning Techniques

Text Categorization

Text Categorization

CS 391L: Machine Learning Text Categorization

Text Categorization

A Study of Text Categorization

Text Categorization

Text Categorization