300 likes | 451 Views
Integrated Instance- and Class-based Generative Modeling for Text Classification. Antti Puurula University of Waikato Sung- Hyon Myaeng KAIST 5/12/2013 Australasian Document Computing Symposium. Instance vs. Class-based Text Classification. Class-based learning
E N D
Integrated Instance- and Class-based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-HyonMyaeng KAIST 5/12/2013 Australasian Document Computing Symposium
Instance vs. Class-based Text Classification • Class-based learning • Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, … • Pros: compact models, efficient inference, accurate with text data • Cons: document-level information discarded • Instance-based learning • K-Nearest Neighbors, Kernel Density Classifiers, … • Pros: document-level information preserved, efficient learning • Cons: data sparsityreduces accuracy
Instance vs. Class-based Text Classification 2 • Proposal: Tied Document Mixture • integrated instance- and class-based model • retains benefits from both types of modeling • exact linear time algorithms for estimation and inference • Main ideas: • replace Multinomial class-conditional in MNB with a mixture over documents • smooth document models hierarchically with class and background models
Multinomial Naive Bayes • Standard generative model for text classification • Result of simple generative assumptions • Bayes • Naive • Multinomial
Tied Document Mixture • Replace Multinomial in MNB by a mixture over all documents • , where documents models are smoothed hierarchically • , where class models are estimated by averaging the documents
Tied Document Mixture 3 • Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and
Tied Document Mixture 3 • Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and
Tied Document Mixture 3 • Can be described as constraints on a two-level mixture • Document level mixture: • Number of components= • Components assigned to instances • Component weights= • Word level mixture: • Number of components= (hierarchy depth) • Components assigned to hierarchy • Component weights= , , and
Tied Document Mixture 4 • Can be described as a class-smoothed Kernel Density Classifier • Document mixture equivalent to a Multinomial kernel density • Hierarchical smoothing corresponds to mean shift or data sharpening with class-centroids
Hierarchical Sparse Inference • Reduces complexity from to • Same complexity as K-Nearest Neighbors based on inverted indices (Yang, 1994)
Hierarchical Sparse Inference 2 • Precompile values: • decomposes: • Store and in inverted indices
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 3 • Compute first • Update by • Update by to get • Compute • Bayes rule
Hierarchical Sparse Inference 2 • Compute first • Update by • Update by to get • Compute • Bayes rule
Experimental Setup • 14 classification datasets used: • 3 spam classification • 3 sentiment analysis • 5 multi-class classification • 3 multi-label classification • Scripts and datasets in LIBSVM format: • http://sourceforge.net/projects/sgmweka/
Experimental Setup 2 • Classifiers compared: • Multinomial Naive Bayes (MNB) • Tied Document Mixture (TDM) • K-Nearest Neighbors(KNN) (Multinomial distance, distance-weighted vote) • Kernel Density Classifier (KDC) (Smoothed multinomial kernel) • Logistic Regression (LR, LR+) (L2-regularized) • Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss) • LR+ and SVM+ weighted feature vectors by TFIDF • Smoothing parameters optimized for MicroFscore on held-out development sets using Gaussian Random Searches
Results • Training times for MNB, TDM, KNN and KDC linear • At most 70 s for MNB on for OHSU-TREC, 170 s for the others • SVM and LR require iterative algorithms • At most 936 s, for LR on Amazon12 • Did not scale to multi-label datasets in practical times • Classification times for instance-based classifiers higher • At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB • (with 290k terms, 196k labels, 197k documents)
Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ • SVM+ is significantly better on multi-class datasets • TDM is significantly better on spam classification
Results 2 TDM significantly improves on MNB, KNN and KDC Across comparable datasets, TDM is on par with SVM+ • SVM+ is significantly better on multi-class datasets • TDM is significantly better on spam classification
Results 3 TDM reduces classification errors compared to MNB by: >65% in spam classification >26% in sentiment analysis Some correlation between error reduction and number of instances/class. Task types form clearly separate clusters
Conclusion • Tied Document Mixture • Integrated instance- and class-based model for text classification • Exact linear time algorithms, with same complexities as KNN and KDC • Accuracy substantially improved over MNB, KNN and KDC • Competitive with optimized SVM, depending on task type • Many improvements to the basic model possible • Sparse inference scales to hierarchical mixtures of >340k components • Toolkit, datasets and scripts available: • http://sourceforge.net/projects/sgmweka/
Sparse Inference • Sparse Inference (Puurula, 2012) • Use inverted indices to reduce complexity of computing joint for a given • Instead of computing as dot products, compute and update by for each from the inverted index • Reduces joint inference time complexity from dense to
Sparse Inference 2 • Dense representation: = number of features Time complexity: = number of classes
Sparse Inference 3 • Sparse representation: Time complexity: