Improved Inference for Unlexicalized Parsing

Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein

DT DT1 DT2 DT1 DT2 DT3 DT4 DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 [Petrov et al. ‘06] Unlexicalized Parsing Hierarchical, adaptive refinement: 91.2 F1 score on Dev Set (1600 sentences)

1621 min

Treebank Coarse grammar Prune Parse Parse NP … VP NP-apple NP-1 VP-6 VP-run NP-17 NP-dog … … VP-31 NP-eat NP-12 NP-cat … … Refined grammar Refined grammar [Goodman ‘97, Charniak&Johnson ‘05] Coarse-to-Fine Parsing

Prune? For each chart item X[i,j], compute posterior probability: < threshold E.g. consider the span 5 to 12: coarse: refined:

1621 min 111 min (no search error)

X A,B,.. NP … VP ? ??? ??? NP-apple VP-run NP-dog … NP-eat NP-cat … Refined grammar [Charniak et al. ‘06] Multilevel Coarse-to-Fine Parsing Add more rounds of pre-parsing Grammars coarser than X-bar

Hierarchical Pruning Consider again the span 5 to 12: coarse: split in two: split in four: split in eight:

G1 G2 G3 G4 G5 G6 DT DT1 DT2 DT1 DT2 DT3 DT4 Learning DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 Intermediate Grammars X-Bar=G0 G=

1621 min 111 min 35 min (no search error)

the the that that this this some some That That this this these these some some That • This • … • … … That That … … … this this … these this … … these … that … that these … … … some some … some some … EM State Drift (DT tag)

0(G) 1(G) 2(G) 3(G) 4(G) 5(G) G1 G2 G3 G4 G5 G6 G1 G2 G3 G4 G5 G6 Learning Learning Projection i G Projected Grammars X-Bar=G0 G=

Projection  NP Easy: VP S Estimating Projected Grammars Nonterminals? NP0 NP1 VP1 VP0 S0 S1 Nonterminals in (G) Nonterminals in G

? ??? Rules in G Rules in (G) Estimating Projected Grammars Rules? S  NP VP S1 NP1 VP1 0.20 S1 NP1 VP2 0.12 S1 NP2 VP1 0.02 S1 NP2 VP2 0.03 S2 NP1 VP1 0.11 S2 NP1 VP2 0.05 S2 NP2 VP1 0.08 S2 NP2 VP2 0.12

… S  NP VP S1  NP1 VP1 0.20 S1  NP1 VP2 0.12 S1  NP2 VP1 0.02 S1  NP2 VP2 0.03 S2  NP1 VP1 0.11 S2  NP1 VP2 0.05 S2  NP2 VP1 0.08 S2  NP2 VP2 0.12 Rules in G Rules in (G) … Treebank Infinite tree distribution [Corazza & Satta ‘06] Estimating Projected Grammars Estimating Grammars 0.56

Calculating Expectations • Nonterminals: • ck(X): expected counts up to depth k • Converges within 25 iterations (few seconds) • Rules:

1621 min 111 min 35 min 15 min (no search error)

G1 G2 G3 G4 G5 G6 60 % 12 % 7 % Learning 6 % 6 % 5 % 4 % Parsing times X-Bar=G0 G=

Bracket Posteriors (after G0)

Bracket Posteriors (after G1)

Bracket Posteriors (Movie) (Final Chart)

Bracket Posteriors (Best Tree)

-2 -1 Parses: Derivations: -2 -1 -1 -2 -1 -1 -2 -1 -1 -2 -1 -1 Parse Selection Computing most likely unsplit tree is NP-hard: • Settle for best derivation. • Rerank n-best list. • Use alternative objective function.

[Titov & Henderson ‘06] Parse Risk Minimization • Expected loss according to our beliefs: • TT : true tree • TP : predicted tree • L : loss function (0/1, precision, recall, F1) • Use n-best candidate list and approximate expectation with samples.

Reranking Results

Dynamic Programming [Matsuzaki et al. ‘05] Approximate posterior parse distribution à la [Goodman ‘98] Maximize number of expected correct rules

Dynamic Programming Results

Final Results (Efficiency) • Berkeley Parser: • 15 min • 91.2 F-score • Implemented in Java • Charniak & Johnson ‘05 Parser • 19 min • 90.7 F-score • Implemented in C

Final Results (Accuracy)

Conclusions • Hierarchical coarse-to-fine inference • Projections • Marginalization • Multi-lingual unlexicalized parsing

Thank You! Parser available at http://nlp.cs.berkeley.edu

Improved Inference for Unlexicalized Parsing