200 likes | 365 Views
A Study of Some Factors Impacting SuperARV Language Modeling. Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology & Research Laboratory SRI International 2. School of Electrical and Computer Engineering Purdue University. Motivation.
E N D
A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang1 Andreas Stolcke1 Mary P. Harper2 1. Speech Technology & Research Laboratory SRI International 2. School of Electrical and Computer Engineering Purdue University EARS STT Workshop
Motivation • RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper] • N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set • Dev04: achieved 1.0% absolute WER reduction over baseline LM • Eval04: no gain in WER (in fact, a small loss) • RT-04 SARV LM was developed under considerable time pressure • Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data • Did not have time to examine all design choices in combination • Reexamine all design decisions in detail EARS STT Workshop
What Changed? RT-04 SARV training differed from RT-03 in 2 aspects: • Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank The 2003 parser was trained on the WSJ Treebank only. • Built a SuperARV LM with additional modifiee lexical feature constraints (Standard+ model) The 2003 LM was a SuperARV LM without these additional constraints (Standard model) Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data. EARS STT Workshop
Plan of Attack • Revisit changes to training procedure • Check effect on old and new data sets and systems • Revisit the backoff N-gram approximation • Did we just get lucky in 2003 ? • Evaluate full SuperARV LM in N-best rescoring • Find better approximations • Start investigation by going back to 2003 LM, then move to current system. • Validate training software (and document and release) • Work in progress • Holding out on eval04 testing (avoid implicit tuning) EARS STT Workshop
Perplexity of RT-03 LMs • RT-03 LM training data • LM types tested: • “Word”: Word backoff 4-gram, KN smoothed • “SARV N-gram”: N-gram approximation to standard SuperARV LM • “SARV Standard”: full SuperARV (without additional constraints) • Full model gains smaller on dev04 • N-gram approximation breaks down EARS STT Workshop
N-best Rescoring with Full SuperARV LM • Evaluated full Standard SARV LM in final N-best rescoring • Based on PLP subsystem of RT-03 CTS system • Full SARV rescoring is expensive, so tried increasingly longer N-best lists • Top-50 • Top-500 • Top-2000 (max used in eval system) • Early passes (including MLLR) use baseline LM, so gains will be limited EARS STT Workshop
RT-03 LM N-best Rescoring Results • Standard SuperARV reduces WER on eval02, eval03 • No gain on dev04 • Identical gains on eval03-SWB and eval03-Fisher • SuperARV gain increases with larger hypothesis space EARS STT Workshop
Adding Modifiee Constraints • Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints • Expected that this information can improve SuperARV LM. • In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure. • This reduced perplexity and WER in early experiments. • But: additional tag constraints could have hurt LM generalization! EARS STT Workshop
Perplexity with Modifiee Constraints • Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model) • Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard) • But not on Fisher (dev04) test set! EARS STT Workshop
N-best Rescoring with Modifiee Constraints • WER reductions consistent with perplexity results • No improvement on dev04. EARS STT Workshop
In-domain vs. Out-of-domain Parser Training • SuperARVs are collected from CDG parses that are obtained by transforming CFG parses • CFG parses are generated using existing state-of-the-art parsers. • In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser) • In 2004: Obtained trainable version of Charniak parser • Retrained parser on a combination of Switchboard Treebank and WSJ Treebank (in-domain parser) • Expected improved consistency and accuracy of parse structures • However, there were bugs in that retraining; fixed for the current experiment. EARS STT Workshop
Rescoring Results with In-domain Parser • Reparsed the RT-03 LM training with in-domain parser • Retrained Standard SuperARV model (“Standard-retrained”) • N-best rescoring system as before • In-domain parsing helps • Also: number of distinct SuperARV tags reduced in retraining (improved parser consistency) EARS STT Workshop
Summary So Far • Prior design decisions have been validated • Adding modifiee constraints helps LM on matched data • Reparsing with retrained in-domain parser improves LM quality • Now: reexamine approximation used in decoding • (work in progress) EARS STT Workshop
N-best Rescoring with RT-04 Full Standard+ Model • RT-04 model is “Standard+” model (includes modifee constraints) • RT-04 had been built with in-domain parser • Caveat: old parser runs fraught with some (not catastrophic) bugs, still need to reparse RT-04 LM training data (significantly more than RT-03 data) • Improved WER, but smaller gains than on older test sets • Gains improve with more hypotheses • Suggests need for better approximation to enable use of SuperARV in search EARS STT Workshop
Original N-gram Approximation Algorithm • Algorithm Description: • For each ngram observed in the training data (note their SuperARV tag information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization; • For each of these ngrams, w1…wn, (note their tags are t1…tn), • Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t1…tn, denoted st1…stn; • Find the list of word sequences sharing the same short-SuperARV sequences as st1…stn, using the lexicon constructed after training; • We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold; • The resulting LM could be pruned to make its size comparable to a word-based LM. • If the held-out set is small, algorithm will result in overfitting • If the held-out set is large, algorithm will be slow. EARS STT Workshop
Revised N-gram Approximation for SuperARV LMs • Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri] • Include all N-grams that “matter” to the decoder • Method: Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices Step 2: extract N-grams from the HTK lattices; prune based on posterior counts Step 3: compute conditional probabilities for these N-grams using a standard SuperARV language model Step 4: compute backoff weights based on the conditional probabilities Step 5: apply the resulting N-gram LM in all subsequent decoding passes (using standard tools) • Some approximations left: • Due to pruning in Step 2 • From using only N-gram context, not full sentence prefix • Drawback: Step 3 takes significant compute time • currently 10xRT, but not optimized for speed yet EARS STT Workshop
Lattice N-gram ApproximationExperiment • Based on RT-03 Standard SuperARV LM • Extracted N-grams from first-pass HTK lattices • Pruned N-grams with total posterior count < 10-3 • Left with 3.6M N-grams on a 6h test set • RT-02/03 experiment • Uses 2003 acoustic models • 2000-best rescoring (1st pass) • Dev-04 experiment • Uses 2004 acoustic models • Lattice rescoring (1st pass) EARS STT Workshop
Lattice N-gram Approximation Results • 1.2% absolute gain on old (matched) test sets • Small 0.2% gain on Fisher (mismatched) test set • Recall: no Fisher gain previously with N-best rescoring • Better exploitation of full hypothesis space yields results EARS STT Workshop
Conclusions and Future Work • There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. • When making a model more constrained, its generality may be reduced. • Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data • We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain. • For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent. • Observed LM win from better parses (using retrained parser) • Can expect further gains from advances in parse accuracy EARS STT Workshop
Conclusions and Future Work (Cont.) • Old N-gram approximation was flawed • New N-gram approximation looks promising, but also needs more work • Tests using full system • Rescoring algorithm needs speeding up • Still to do: reparse current CTS LM training set. • Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework. EARS STT Workshop