280 likes | 405 Views
Beam-Width Prediction for Efficient Context-Free Parsing. Nathan Bodenstab, Aaron Dunlop, Keith Hall, Brian Roark. June 2011. OHSU Beam-Search Parser (BUBS). Standard bottom-up CYK Beam-search per chart cell Only “best” are retained. Ranking, Prioritization, and FOMs. f() = g() + h()
E N D
Beam-Width Prediction for Efficient Context-Free Parsing Nathan Bodenstab, Aaron Dunlop, Keith Hall, Brian Roark June 2011
OHSU Beam-Search Parser (BUBS) • Standard bottom-up CYK • Beam-search per chart cell • Only “best” are retained
Ranking, Prioritization, and FOMs • f() = g() + h() • Figure of Merit • Caraballo and Charniak (1997) • A* search • Klein and Manning (2003) • Pauls and Klein (2010) • Other • Turrian (2007) • Huang (2008) • Apply to beam-search
Beam-Width Prediction • Traditional beam-search uses constant beam-width • Two definitions of beam-width: • Number of local competitors to retain (n-best) • Score difference from best entry • Advantages • Heavy pruning compared to CYK • Minimal sorting compared to global agenda • Disadvantages • No global pruning – all chart cells treated equal • Conservative to keep outliers within beam
Beam-Width Prediction • How often is gold edge ranked in top N per chart cell • Exhaustively parse section 22 + Berkeley latent variable grammar Cumulative Gold Edges Gold rank <= N
Beam-Width Prediction • How often is gold edge ranked in top N per chart cell • Exhaustively parse section 22 + Berkeley latent variable grammar Cumulative Gold Edges Gold rank <= N
Beam-Width Prediction • Beam-search + C&C Boundary ranking: • How often is gold edge ranked in top N per chart cell: To maintain baseline accuracy, beam-width must be set to 15 with C&C Boundary ranking (and 50 using only inside score) Cumulative Gold Edges Gold rank <= N
Beam-Width Prediction • Beam-search + C&C Boundary ranking: • How often is gold edge ranked in top N per chart cell: To maintain baseline accuracy, beam-width must be set to 15 with C&C Boundary ranking (and 50 using only inside score) • Over 70% of gold edges are already ranked first in the local agenda • 14 of 15 edges in these cells are unnecessary • We can do much better than a constant beam-width Cumulative Gold Edges Gold rank <= N
Beam-Width Prediction • Method: Train an averaged perceptron (Collins, 2002) to predict the optimal beam-width per chart cell • Map each chart cell in sentence S spanning words wi … wj to a feature vector representation: • x: Lexical and POS unigrams and bigrams, relative and absolute span • y:1 if gold rank > k, 0 otherwise (no gold edge has rank of -1) • Minimize the loss: • H is the unit step function k k
Beam-Width Prediction • Method: Use a discriminative classifier to predict the optimal beam-width per chart cell • Minimize the loss: • L is the asymmetric loss function: • If beam-width is too large, tolerable efficiency loss • If beam-width is too small, high risk to accuracy • Lambda set to 102 in all experiments k
Beam-Width Prediction Special case: Predict if chart cell is open or closed to multi-word constituents
Beam-Width Prediction • A “closed” chart cell may need to be partially open • Binarized or dotted-rule parsing creates new “factored” productions:
Beam-Width Prediction Method 1: Constituent Closure
Beam-Width Prediction • Constituent Closure is a per-cell generalization of Roark & Hollingshead (2008) • O(n2) classifications instead of O(n)
Beam-Width Prediction Method 2: Complete Closure
Beam-Width Prediction Method 3: Beam-Width Prediction
Beam-Width Prediction Method 3: Beam-Width Prediction • Use multiple binary classifiers instead of regression (better performance) • Local beam-width taken from classifier with smallest beam-width prediction • Best performance with four binary classifiers: 0, 1, 2, 4 • 97% of positive examples have beam-width <= 4 • Don’t need a classifier for every possible beam-width value between 0 and global maximum (15 in our case)
Beam-Width Prediction 1.0 0.8 0.6 0.4 0.2 0.0
Beam-Width Prediction • Section 22 development set results • Decoding time is seconds per sentence averaged over all sentences in Section 22 • Parsing with Berkeley latent variable grammar (4.3 million productions)
Beam-Width Prediction • Section 23 test results • Only MaxRule is marginalizing over latent variables and performing non-Viterbi decoding
FOM Details • C&C FOM Details • FOM(NT) = Outsideleft * Inside * Outsideright • Inside = Accumulated grammar score • Outsideleft = MaxPOS [ POS forward prob * POS-to-NT transition prob ] • Outsideright = MaxPOS [ NT-to-POS transition prob * POS bkwdprob ]
FOM Details • C&C FOM Details