Self-training with Products of Latent Variable Grammars

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov

Overview • Motivation and Prior Related Research • Experimental Setup • Results • Analysis • Conclusions

Parse Tree Sentence Parameters Derivations ... PCFG-LA Parser[Matsuzaki et. al ’05] [Petrov et. al ’06] [Petrov & Klein’07]

PCFG-LA Parser • Hierarchical splitting (& merging) • Typical learning curve • Grammar Order Selection • Use development set Increased Model Complexity n-th grammar: grammar trained after n-th split-merge rounds

NP VP S Max-Rule Decoding (Single Grammar) [Goodman ’98, Matsuzaki et al. ’05, Petrov & Klein ’07]

Variability [Petrov, ’10]

... Max-Rule Decoding (Multiple Grammars) Treebank [Petrov, ’10]

Product Model Results [Petrov, ’10]

Motivation for Self-Training

Self-training (ST) Select with dev Hand Labeled Train Train Automatically Labeled Data Unlabeled Data Label

Self-Training Curve

WSJ Self-Training Results F score [Huang & Harper, ’09]

Self-trained Round 6 Self-trained Round 7 Self-Trained Grammar Variability

Summary • Two issues: Variability & Over-fitting • Product model • Makes use of variability • Over-fitting remains in individual grammars • Self-training • Alleviates over-fitting • Variability remains in individual grammars • Next step: combine self-training with product models

Experimental Setup • Two genres: • WSJ: Sections 2-21 for training, 22 for dev, 23 for test, 176.9K sentences per self-trained grammar • Broadcast News: WSJ+80% of BN for training, 10% for dev, 10% for test (see paper), • Training Scenarios: train 10 models with different seeds and combine using Max-Rule Decoding • Regular: treebank training with up to 7 split-merge iterations • Self-Training: three methods with up to 7 split-merge iterations

ST-Reg Multiple Grammars? Select with dev set Hand Labeled Train Train Train ⁞ Product Unlabeled Data Automatically Labeled Data Label Single automatically labeled set by round 6 product

ST-Prod Hand Labeled Train Product ⁞ Use more data? Train ⁞ Product Unlabeled Data Automatically Labeled Data Label Single automatically labeled set by round 6 product

ST-Prod-Mult Hand Labeled Product Product Train ⁞ Label ⁞ Product ⁞ 10 different automatically labeled sets by round 6 product Label

A Closer Look at Regular Results

A Closer Look at Self-Training Results

Analysis of Rule Variance • We measure the average empirical variance of the log posterior probabilities of the rules among the learned grammars over a held-out set S to get at the diversity among the grammars:

Analysis of Rule Variance

English Test Set Results (WSJ 23) This Work [Huang & Harper ’08] [Huang ’08] [McClosky et al. ’06] This Work [Charniak & Johnson ’05] [Sagae & Lavie ’06] [Fossum & Knight ’09] [Zhang et al. ’09] [Petrov ’10] [Carreras et al. ’08] Petrov et al. ’06] [Charniak ’00] Single Parser Reranker Product Parser Combination

Broadcast News

Conclusions • Very high parse accuracies can be achieved by combining self-training and product models on newswire and broadcast news parsing tasks. • Two important factors: • Accuracy of the model used to parse the unlabeled data • Diversity of the individual grammars

Self-training with Products of Latent Variable Grammars