200 likes | 214 Views
Explore the importance of vocabulary selection in text classification using a variational approach to identify the minimum vocabulary needed for accuracy. Learn about different algorithms, optimization methods, and experimental results.
How Large A Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection Wenhu Chen, Yu Su, Xifeng Yan, William Wang UC Santa Barbara
Background • In deep learning based approaches, we need to predefine a vocabulary to vectorizethe text as continuous representation using embedding. • The widely adopted method is called frequency based method. • Simple but agnostic to the end task. • Under-sized or Over-sized vocabulary. • Is this naïve algorithm the optimal solution?
Questions • Q1: How important a role does the vocabulary selection algorithm play in text classification? • Q2: How to select the minimum vocabulary needed to maintain a specific accuracy in text classification?
Q1: How important is vocabulary selection? • Fix a vocabulary budget from the full vocabulary . • Randomly sample different vocabulary combination of size to evaluate their classification accuracy. • Observe the variation range in classification accuracy. Gap
Q1: How important is vocabulary selection? • At each vocabulary budget, simulate 100 different combinations • At the vocabulary of 30, different combinations: 33.2% -> 80%. • At the vocabulary of 5000, different combinations: 89.5% -> 90.1%. 80.1 33.2
Q1: How important is vocabulary selection? • With large vocabulary budget, different selection algorithms do not make dramatic difference. • With restricted vocabulary budget, the accuracy gap between selection algorithms is significant. • Vocabulary selection is an important research problem under memory constrained cases.
Q2: How to select the vocabulary subset? • Assuming classification networks, full embedding matrix • Constrained Optimization: • Goal: we need to find the smallest subset embedding with the tolerable accuracy drop .
Q2: How to evaluate the selection algorithm? • Draw Vocabulary-Accuracy Curve. • Area Under the Curve (larger is better). • Vocabulary Under X% Accuracy Drop (smaller is better).
Re-interpret the optimization problem • Reinterpretation: we associate a dropout probability with each row of the embedding matrix W. • Neither the objective nor the constraints are differentiable, the standard approach does not apply. • The dropout probability reflects the importance of the given word in the classification task.
Bernouli Dropout • Problem Definition: dropout mask • Training Objective (Marginal Log-Likelihood): • Enumerate combinations, which is Intractable. • BernouliMonte-Carlo sampling has large variance.
Gaussian Approximation Bayesian Neural Network • Gaussian Approximation (Wang et al.): • Lower Bound of Marginal Likelihood: Reconstruction Objective KL-Divergence • Gaussian Reparameterizationdecreases variance in training. Wang et al. Fast Dropout Training. In ICML 2013
Variational Dropout: Sparsity • After training, we obtain the dropout probability associated with each word . • We adjust the threshold to retain the vocabulary subset and evaluate its accuracy. • By changing , we can draw the vocabulary-accuracy curve to evaluate its performance using the proposed metrics.
Datasets &Baselines • Tasks: • Document Classification (AG-news, Yelp-review, Dbpedia, …) • Spoken Language Understanding (ATIS, Snips) • Natural Language Inference (SNLI, MNLI) • Baselines: • Frequency Cutoff • TF-IDF • Group Lasso
Experimental Results • Frequency-based Cut-off can shrink 50k into 1k with 3% accuracy drop. • Variational Dropout can shrink even further to 400.
Vocabulary-Accuracy Curve • Variational Dropout achieves better accuracy than baselines across different vocabulary budget. AG-News Yelp-Review
Variational Dropout: Visualization • The selection algorithm has strong correlation with frequency-based method, but not entirely overlapped with it. Left is the most frequent words, right is the rarest words
Variational Dropout: Visualization • The selected words on a voice assistance SLU dataset • Important Word: Schedule, movie, neighborhood, theatre
Comparison with Subword/Character-based • Pros: • Our method can be used to broader languages which cannot bedecomposed into characters/subwords. • Our method accelerates the inference speed, while character-based methods reduces the speed by increasing the input length. • Cons: • Our method suffers loss of information, which makes it not applicable to machine translationor summarization.
Takeaway Message • Frequency-based cut-off method is a very strong selection mechanism under reasonable vocabulary budget. • Vocabulary Dropout can achieve better much accuracy under very low vocabulary budget. • Vocabulary Dropout can provide strong interpretability for the decision making by providing the selected words.
Thanks! Code and Data: https://github.com/wenhuchen/Variational-Vocabulary-Selection