1 / 20

Wenhu Chen , Yu Su, Xifeng Yan, William Wang UC Santa Barbara

Explore the importance of vocabulary selection in text classification using a variational approach to identify the minimum vocabulary needed for accuracy. Learn about different algorithms, optimization methods, and experimental results.

fagin
Download Presentation

Wenhu Chen , Yu Su, Xifeng Yan, William Wang UC Santa Barbara

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Large A Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection  Wenhu Chen, Yu Su, Xifeng Yan, William Wang UC Santa Barbara

  2. Background • In deep learning based approaches, we need to predefine a vocabulary to vectorizethe text as continuous representation using embedding. • The widely adopted method is called frequency based method. • Simple but agnostic to the end task. • Under-sized or Over-sized vocabulary. • Is this naïve algorithm the optimal solution?

  3. Questions • Q1: How important a role does the vocabulary selection algorithm play in text classification? • Q2: How to select the minimum vocabulary needed to maintain a specific accuracy in text classification?

  4. Q1: How important is vocabulary selection? • Fix a vocabulary budget from the full vocabulary . • Randomly sample different vocabulary combination of size to evaluate their classification accuracy. • Observe the variation range in classification accuracy. Gap

  5. Q1: How important is vocabulary selection? • At each vocabulary budget, simulate 100 different combinations • At the vocabulary of 30, different combinations: 33.2% -> 80%. • At the vocabulary of 5000, different combinations: 89.5% -> 90.1%. 80.1 33.2

  6. Q1: How important is vocabulary selection? • With large vocabulary budget, different selection algorithms do not make dramatic difference. • With restricted vocabulary budget, the accuracy gap between selection algorithms is significant. • Vocabulary selection is an important research problem under memory constrained cases.

  7. Q2: How to select the vocabulary subset? • Assuming classification networks, full embedding matrix • Constrained Optimization: • Goal: we need to find the smallest subset embedding with the tolerable accuracy drop .

  8. Q2: How to evaluate the selection algorithm? • Draw Vocabulary-Accuracy Curve. • Area Under the Curve (larger is better). • Vocabulary Under X% Accuracy Drop (smaller is better).

  9. Re-interpret the optimization problem • Reinterpretation: we associate a dropout probability with each row of the embedding matrix W. • Neither the objective nor the constraints are differentiable, the standard approach does not apply. • The dropout probability reflects the importance of the given word in the classification task.

  10. Bernouli Dropout • Problem Definition: dropout mask • Training Objective (Marginal Log-Likelihood): • Enumerate combinations, which is Intractable. • BernouliMonte-Carlo sampling has large variance.

  11. Gaussian Approximation Bayesian Neural Network • Gaussian Approximation (Wang et al.): • Lower Bound of Marginal Likelihood: Reconstruction Objective KL-Divergence • Gaussian Reparameterizationdecreases variance in training. Wang et al. Fast Dropout Training. In ICML 2013

  12. Variational Dropout: Sparsity • After training, we obtain the dropout probability associated with each word . • We adjust the threshold to retain the vocabulary subset and evaluate its accuracy. • By changing , we can draw the vocabulary-accuracy curve to evaluate its performance using the proposed metrics.

  13. Datasets &Baselines • Tasks: • Document Classification (AG-news, Yelp-review, Dbpedia, …) • Spoken Language Understanding (ATIS, Snips) • Natural Language Inference (SNLI, MNLI) • Baselines: • Frequency Cutoff • TF-IDF • Group Lasso

  14. Experimental Results • Frequency-based Cut-off can shrink 50k into 1k with 3% accuracy drop. • Variational Dropout can shrink even further to 400.

  15. Vocabulary-Accuracy Curve • Variational Dropout achieves better accuracy than baselines across different vocabulary budget. AG-News Yelp-Review

  16. Variational Dropout: Visualization • The selection algorithm has strong correlation with frequency-based method, but not entirely overlapped with it. Left is the most frequent words, right is the rarest words

  17. Variational Dropout: Visualization • The selected words on a voice assistance SLU dataset • Important Word: Schedule, movie, neighborhood, theatre

  18. Comparison with Subword/Character-based • Pros: • Our method can be used to broader languages which cannot bedecomposed into characters/subwords. • Our method accelerates the inference speed, while character-based methods reduces the speed by increasing the input length. • Cons: • Our method suffers loss of information, which makes it not applicable to machine translationor summarization.

  19. Takeaway Message • Frequency-based cut-off method is a very strong selection mechanism under reasonable vocabulary budget. • Vocabulary Dropout can achieve better much accuracy under very low vocabulary budget. • Vocabulary Dropout can provide strong interpretability for the decision making by providing the selected words.

  20. Thanks! Code and Data: https://github.com/wenhuchen/Variational-Vocabulary-Selection

More Related