370 likes | 568 Views
Random Forests for Language Modeling. Peng Xu, Frederick Jelinek. Outline. Basic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM) Experiments
E N D
Random Forests for Language Modeling Peng Xu, Frederick Jelinek
Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work
Basic Language Modeling Estimate the source probability from a training corpus: large amount of words chosen for similarity to expected sentences Parametric conditional models history
Basic Language Modeling Smooth Models: Perplexity (PPL): n-gram Models:
Estimate n-gram Parameters Maximum Likelihood (ML) estimate: • Best on training data: lowest PPL Data sparseness problem: n=3, |V|=10k, a trillion words needed Zero probability for almost all test data!
Neural network: represent words in real space , use exponential model Dealing with Sparsity • Smoothing: use lower order statistics Word clustering: reduce the size of V History clustering: reduce the number of histories Maximum entropy: use exponential models
Smoothing Techniques Good smoothing techniques: Deleted Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN) • Kneser-Ney: consistently the best [Chen & Goodman, 1998]
Decision Tree Language Models Goal: history clustering by a binary decision tree (DT) • Internal nodes: a set of histories, one or two questions • Leaf nodes: a set of histories • Node splitting algorithms • DT growing algorithms
{ab,ac,bc,bd} a:3 b:2 Is the first word ‘a’? Is the first word ‘b’? {ab,ac} a:2 b:1 {bc,bd} a:1 b:1 Example DT New event ‘cba’: Stuck! Training data: aba, aca, acb, bcb, bda
Previous Work • DT is an appealing idea: deal with data sparseness • [Bahl, et al 1989] 20 words in histories, slightly better than 3-gram • [Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram • Both are top-down with a stopping criterion • Why doesn’t it work in practice? • Training data fragmentation: data sparseness • No theoretically founded stopping criterion: early termination • Greedy algorithms: early termination
Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work
Random Forests • [Amit & Geman 1997] shape recognition with randomized trees • [Ho 1998] random subspace • [Breiman 2001] random forests Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.
Our Goal Main problems: • Data sparseness • Smoothing • Early termination • Greedy algorithms Expectations from Random Forests: • Less greedy algorithms: randomization and voting • Avoid early termination: randomization • Conquer data sparseness: voting
Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram: general approach • Structured Language Model (SLM) • Experiments • Conclusions and Future Work
General DT Growing Approach • Grow a DT until maximum depth using training data • Perform no smoothing during growing • Prune fully grown DT to maximize heldout data likelihood • Incorporate KN smoothing during pruning
Node Splitting Algorithm Questions: about identities of words in the history Definitions: • H(p): the set of histories in a node p • position: distance from a word in the history to predicted word • i(v) : the set of histories with word v in position • split:non-emptysets Ai and Bi, consists of i(v) • L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies
Node Splitting Algorithm Algorithm Sketch: • For each position i • Initialization: Ai, Bi • For each i(v) Ai • Tentatively move i(v) to Bi • Calculate log-likelihood increase L(Ai- i(v)) - L(Ai) • If the increase is positive, move i(v) and modify counts • Carry out the same for each i(v) Bi • Repeat b)-c) until no move is possible • Split the node according to the best position: the increase in log-likelihood is the largest
Pruning a Decision Tree Smoothing: Define: • L(p): set of all leaves rooted in p • LH(p): smoothed heldout data log-likelihood in p • LH(L(p)): smoothed heldout data log-likelihood in L(p) • potential : LH(L(p)) - LH(p) Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)
Generating random forests LM: • M decision trees are grown randomly • Each DT generates a probability sequence on test data • Aggregation: Towards Random Forests Randomized question selection: • Randomized initialization: Ai, Bi • Randomized position selection
Remarks on RF-LM Random Forest Language Model (RF-LM) : A collection of randomly constructed DT-LMs • A DT-LM is an RF-LM: small forest • An n-gram LM is a DT-LM: no pruning • An n-gram LM isan RF-LM! • Single compact model
Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work
SLM Probabilities Joint probability of words and parse: Word probabilities:
Using RFs for the SLM Ideally: running the SLM one time Parallel approximation: running the SLM M times Aggregate M probability sequences
Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work
Experiments Goal: Compare with Kneser-Ney (KN) • Perplexity (PPL): • UPenn Treebank: 1 million words training, 82k test • Normalized text • Word Error Rate (WER): • WSJ text: 20 or 40 million words training • WSJ DARPA’93 HUB1 test data: 213 utterances, 3446 words • N-best rescoring: standard trigram baseline on 40 million words
Experiments: trigram perplexity • Baseline: KN-trigram • No randomization: DT-trigram • 100 random DTs: RF-trigram
Experiments: Aggregating • Improvements within 10 trees!
Experiments: Why does it work? seen event: • KN-trigram: in training data • DT-trigram: in training data • RF-trigram: in training data for any m
Experiments: SLM perplexity • Baseline: KN-SLM • 100 random DTs for each of the components • Parallel approximation • Interpolate with KN-trigram
Experiments: speech recognition • Baseline: KN-trigram, KN-SLM • 100 random DTs for RF-trigram, RF-SLM-P (predictor) • Interpolate with KN-trigram (40M)
Conclusions • New RF language modeling approach More general LM: RF DT n-gram Randomized history clustering: non-reciprocal data sharing Good performance in PPL and WER Generalize well to unseen data Portable to other tasks
Future Work • Random samples of training data • More linguistically oriented questions • Direct implementation in the SLM • Lower order random forests • Larger test data for speech recognition • Language model adaptation