280 likes | 515 Views
A Survey on Text Categorization with Machine Learning. Chikayama lab. Dai Saito. Introduction: Text Categorization. Many digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost.
E N D
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
Introduction:Text Categorization • Many digital Texts are available • E-mail, Online news, Blog … • Need of Automatic Text Categorization is increasing • without human resource • Merits of time and cost
Introduction:Text Categorization • Application • Spam filter • Topic Categorization
Introduction:Machine Learning • Making Categorization rule automatically by Feature of Text • Types of Machine Learning (ML) • Supervised Learning • Labeling • Unsupervised Learning • Clustering
Introduction:flow of ML • Prepare training Text data with label • Feature of Text • Learn • Categorize new Text Label1 ? Label2
Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion
Number of labels • Binary-label • True or False (Ex. spam or not) • Applied for other types • Multi-label • Many labels, butOne Text has one label • Overlapping-label • One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4
Types of labels • Topic Categorization • Basic Task • Compare individual words • Author Categorization • Sentiment Categorization • Ex) Review of products • Need more linguistic information
Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion
Feature of Text • How to express a feature of Text? • “Bag of Words” • Ignore an order of words • Structure • Ex) I like this car. | I don’t like this car. • “Bag of Words” will not work well • (d:document = text) • (t:term = word)
Preprocessing • Remove stop words • “the”“a”“for”… • Stemming • relational -> relate, truly -> true
Term Weighting • Term Frequency • Number of a term in a document • Frequent terms in a document seems to be important for categorization • tf・idf • Terms appearing in many documents are not useful for categorization
bad happy good Sentiment Weighting • For sentiment classification,weight a word as Positive or Negative • Constructing sentiment dictionary • WordNet [04 Kamps et al.] • Synonym Database • Using a distancefrom ‘good’ and ‘bad’ d (good, happy) = 2 d (bad, happy) = 4
Dimension Reduction • Size of feature vector is (#terms)*(#documents) • #terms ≒ size of dictionary • High calculation cost • Risk of overfitting • Best for training data ≠ Best for real data • Choosing effective feature • to improve accuracy and calculation cost
Dimension Reduction • df-threshold • Terms appearing in very few documents(ex.only one) are not important • Score • If t and cj are independent, Score is equal to Zero
Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion
Learning Algorithm • Many (Almost all?) algorithms are used in Text Categorization • Simple approach • Naïve Bayes • K-Nearest Neighbor • High performance approach • Boosting • Support Vector Machine • Hierarchical Learning
Naïve Bayes • Bayes Rule • This value is hard to calculate • ? • Assumption : each terms occurs independently
k-Nearest Neighbor • Define a “distance” of two Texts • Ex)Sim(d1, d2) = d1・d2 / |d1||d2| = cosθ • check k of high similarityTexts and categorize bymajority vote • If size of test data is larger, memory and search cost is higher d1 k=3 d2 θ
Boosting • BoosTexter [00 Schapire et al.] • Ada boost • making many “weak learner”s with different parameters • Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data • BoosTexter uses Decision Stump as “weak learner”
1. 2. 3. + + + + + + - - + - + - + + + + - - - - + + + + - - - - + + + + - - - - - - - - Simple example of Boosting
Support Vector Machine • Text Categorization with SVM[98 Joachims] • Maximize margin
Text Categorization with SVM • SVM works well for Text Categorization • Robustness for high dimension • Robustness for overfitting • Most Text Categorization problems are linearly separable • All of OHSUMED (MEDLINE collection) • Most of Reuters-21578 (NEWS collection)
Comparison of these methods • [02 Sebastiani] • Reuters-21578 (2 versions) • difference: number of Categories
Hierarchical Learning • TreeBoost[06 Esuli et al.] • Boosting algorithm for Hierarchical labels • Hierarchical labels and Texts with label as Training data • Applying AdaBoost recursively • Better classifier than ‘flat’ AdaBoost • Accuracy: 2-3% up • Time: training and categorization time down • Hierarchical SVM[04 Cai et al.]
TreeBoost root L2 L4 L1 L3 L11 L12 L41 L42 L43 L421 L422
Outline • Introduction • Text Categorization • Feature of Text • Learning Algorithm • Conclusion
Conclusion • Overview of Text Categorizationwith Machine Learning • Feature of Text • Learning Algorithm • Future Work • Natural Language Processing with Machine Learning, especially in Japanese • Calculation Cost