Wenhu Chen , Yu Su, Xifeng Yan, William Wang UC Santa Barbara

How Large A Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection Wenhu Chen, Yu Su, Xifeng Yan, William Wang UC Santa Barbara

Background • In deep learning based approaches, we need to predefine a vocabulary to vectorizethe text as continuous representation using embedding. • The widely adopted method is called frequency based method. • Simple but agnostic to the end task. • Under-sized or Over-sized vocabulary. • Is this naïve algorithm the optimal solution?

Questions • Q1: How important a role does the vocabulary selection algorithm play in text classification? • Q2: How to select the minimum vocabulary needed to maintain a specific accuracy in text classification?

Q1: How important is vocabulary selection? • Fix a vocabulary budget from the full vocabulary . • Randomly sample different vocabulary combination of size to evaluate their classification accuracy. • Observe the variation range in classification accuracy. Gap

Q1: How important is vocabulary selection? • At each vocabulary budget, simulate 100 different combinations • At the vocabulary of 30, different combinations: 33.2% -> 80%. • At the vocabulary of 5000, different combinations: 89.5% -> 90.1%. 80.1 33.2

Q1: How important is vocabulary selection? • With large vocabulary budget, different selection algorithms do not make dramatic difference. • With restricted vocabulary budget, the accuracy gap between selection algorithms is significant. • Vocabulary selection is an important research problem under memory constrained cases.

Q2: How to select the vocabulary subset? • Assuming classification networks, full embedding matrix • Constrained Optimization: • Goal: we need to find the smallest subset embedding with the tolerable accuracy drop .

Q2: How to evaluate the selection algorithm? • Draw Vocabulary-Accuracy Curve. • Area Under the Curve (larger is better). • Vocabulary Under X% Accuracy Drop (smaller is better).

Re-interpret the optimization problem • Reinterpretation: we associate a dropout probability with each row of the embedding matrix W. • Neither the objective nor the constraints are differentiable, the standard approach does not apply. • The dropout probability reflects the importance of the given word in the classification task.

Bernouli Dropout • Problem Definition: dropout mask • Training Objective (Marginal Log-Likelihood): • Enumerate combinations, which is Intractable. • BernouliMonte-Carlo sampling has large variance.

Gaussian Approximation Bayesian Neural Network • Gaussian Approximation (Wang et al.): • Lower Bound of Marginal Likelihood: Reconstruction Objective KL-Divergence • Gaussian Reparameterizationdecreases variance in training. Wang et al. Fast Dropout Training. In ICML 2013

Variational Dropout: Sparsity • After training, we obtain the dropout probability associated with each word . • We adjust the threshold to retain the vocabulary subset and evaluate its accuracy. • By changing , we can draw the vocabulary-accuracy curve to evaluate its performance using the proposed metrics.

Datasets &Baselines • Tasks: • Document Classification (AG-news, Yelp-review, Dbpedia, …) • Spoken Language Understanding (ATIS, Snips) • Natural Language Inference (SNLI, MNLI) • Baselines: • Frequency Cutoff • TF-IDF • Group Lasso

Experimental Results • Frequency-based Cut-off can shrink 50k into 1k with 3% accuracy drop. • Variational Dropout can shrink even further to 400.

Vocabulary-Accuracy Curve • Variational Dropout achieves better accuracy than baselines across different vocabulary budget. AG-News Yelp-Review

Variational Dropout: Visualization • The selection algorithm has strong correlation with frequency-based method, but not entirely overlapped with it. Left is the most frequent words, right is the rarest words

Variational Dropout: Visualization • The selected words on a voice assistance SLU dataset • Important Word: Schedule, movie, neighborhood, theatre

Comparison with Subword/Character-based • Pros: • Our method can be used to broader languages which cannot bedecomposed into characters/subwords. • Our method accelerates the inference speed, while character-based methods reduces the speed by increasing the input length. • Cons: • Our method suffers loss of information, which makes it not applicable to machine translationor summarization.

Takeaway Message • Frequency-based cut-off method is a very strong selection mechanism under reasonable vocabulary budget. • Vocabulary Dropout can achieve better much accuracy under very low vocabulary budget. • Vocabulary Dropout can provide strong interpretability for the decision making by providing the selected words.

Thanks! Code and Data: https://github.com/wenhuchen/Variational-Vocabulary-Selection

Wenhu Chen , Yu Su, Xifeng Yan, William Wang UC Santa Barbara

Wenhu Chen , Yu Su, Xifeng Yan, William Wang UC Santa Barbara

Presentation Transcript

UC/CSU Sustainability Conference Santa Barbara, 2006

Rachel Lambert UC Santa Barbara mathematizing4all@gmail @mathematize4all

Pei-Yu Wang

Uc Santa Cruz

Speaker : Yu- Hui Chen

Jeffrey Berryhill (UC Santa Barbara) for the BaBar Collaboration

Presented by: Yan Wang

Brian Kinlan UC Santa Barbara

Wouter Verkerke (UC Santa Barbara) for the BaBar collaboration

Presenter ： Ying-Yu Chen Authors: Ying-Yu Chen, Justie Su-Tzu Juan

UC Santa Cruz

You- Chiun Wang, Yung-Fu Chen, and Yu- Chee Tseng

Richard Hull (Bell Labs) Jianwen Su (UC Santa Barbara)

UC Santa Barbara

Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara

I-Chen Wang

UC Santa Cruz