240 likes | 364 Views
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars. Overview. Unambiguity Regularization A novel approach for unsupervised natural language grammar learning Based on the observation that natural language is remarkably unambiguous
E N D
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
Overview • Unambiguity Regularization • A novel approach for unsupervised natural language grammar learning • Based on the observation that natural language is remarkably unambiguous • Includes standard EM, Viterbi EM and so-called softmax-EM as special cases
Outline • Background • Motivation • Formulation and algorithms • Experimental results
Background • Unsupervised learning of probabilistic grammars • Learning a probabilistic grammar from unannotated sentences Training Corpus Probabilistic Grammar Induction S ® NP VP NP ®Det N VP ®Vt NP (0.3) | Vi PP (0.2) | rolls (0.2) | bounces(0.1) …… A square is above the triangle. A triangle rolls. The square rolls. A triangle is above the square. A circle touches a square. ……
Background • Unsupervised learning of probabilistic grammars • Typically done by assuming a fixed set of grammar rules and optimizing the rule probabilities • Various prior information can be incorporated into the objective function to improve learning • e.g., rule sparsity, symbol correlation, etc. • Our approach: Unambiguity regularization • Utilizes a novel type of prior information: the unambiguity of natural languages
The Ambiguity of Natural Language • Ambiguities are ubiquitous in natural languages • NL sentences can often be parsed in more than one way • Example [Manning and Schutze (1999)] The post office will hold out discounts and service concessions as incentives. Noun? Verb? Modifies “hold out” or “concessions”? Given a complete CNF grammar of 26 nonterminals, the total number of possible parses is .
The Unambiguity of Natural Language • Although each NL sentence has a large number of possible parses, the probability mass is concentrated on a very small number of parses
Comparison with non-NL grammars NL Grammar Max-Likelihood Grammar Learned by EM Random Grammar
Incorporate Unambiguity Bias into Learning • How to measure the ambiguity • Entropy of the parse given the sentence and the grammar • How to add it into the objective function • Use a prior distribution that prefers low ambiguity Intractable Learning
Incorporate Unambiguity Bias into Learning • How to measure the ambiguity • Entropy of the parse given the sentence and the grammar • How to add it into the objective function • Use posterior regularization [Ganchev et al. (2010)] Entropy of the parses based on q KL-divergence between q and the posterior distribution of the parses Log posterior of the grammar given the training sentences A constant that controls the strength of regularization An auxiliary distribution
Optimization • Coordinate Ascent • Fix and optimize • Exactly the M-step of EM • Fix and optimize • Depends on the value of p q
Optimization • Coordinate Ascent • Fix and optimize • Exactly the M-step of EM • Fix and optimize • Depends on the value of p q
Optimization • Coordinate Ascent • Fix and optimize • Exactly the M-step of EM • Fix and optimize • Depends on the value of p q Softmax-EM
Softmax-EM • Implementation • Simply exponentiate all the grammar rule probabilities before the E-step of EM • Does not increase the computational complexity of the E-step
The value of • Choosing a fixed value of • Too small: not enough to induce unambiguity • Too large: the learned grammar might be excessively unambiguous • Annealing • Start with a large value of • Strongly push the learner away from the highly ambiguous initial grammar • Gradually reduce the value of • Avoid inducing excessive unambiguity
Mean-field Variational Inference • So far: maximum a posteriori estimation (MAP) • Variational inference approximates the posterior of the grammar • Leads to more accurate predictions than MAP • Can accommodate prior distributions that MAP cannot • We have also derived a mean-field variational inference version of unambiguity regularization • Very similar to the derivation of the MAP version
Experiments • Unsupervised learning of the dependency model with valence (DMV) [Klein and Manning, 2004] • Data: WSJ (sect 2-21 for training, sect 23 for testing) • Trained on the gold-standard POS tags of the sentences of length ≤ 10 with punctuation stripped off
Experiments with Different Values of • Viterbi EM leads to high accuracy on short sentences • Softmax-EM ( ) leads to the best accuracy over all sentences
Experiments with Annealing and Prior • Annealing the value of from 1 to 0 in 100 iterations • Adding Dirichlet priors ( ) over rule probabilities using variational inference • Compared with the best results previously published for learning DMV
Experiments on Extended Models • Applying unambiguity regularization on E-DMV, an extension of DMV [Gillenwater et al., 2010] • Compared with the best results previously published for learning extended dependency models
Experiments on More Languages • Examining the effect of unambiguity regularization with the DMV model on the corpora of eight additional languages. • Unambiguity regularization improves learning on eight out of the nine languages, but with different optimal values of . • Annealing the value of leads to better average performance than using any fixed value of .
Related Work • Some previous work also manipulates the entropy of hidden variables • Deterministic annealing [Rose, 1998; Smith and Eisner, 2004] • Minimum entropy regularization [Grandvalet and Bengio, 2005; Smith and Eisner, 2007] • Unambiguity regularization differs from them in • Motivation: the unambiguity of NL grammars • Algorithm: • a simple extension of EM • exponent >1 in the E-step • decreasing the exponent in annealing
Conclusion • Unambiguity regularization • Motivation • The unambiguity of natural languages • Formulation • Regularize the entropy of the parses of training sentences • Algorithms • Standard EM, Viterbi EM, softmax-EM • Annealing the value of • Experiments • Unambiguity regularization is beneficial to learning • By incorporating annealing, it outperforms the current state-of-the-art
Thank you! Q&A