Random Forests for Language Modeling

Random Forests for Language Modeling Peng Xu, Frederick Jelinek

Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

Basic Language Modeling Estimate the source probability from a training corpus: large amount of words chosen for similarity to expected sentences Parametric conditional models history

Basic Language Modeling Smooth Models: Perplexity (PPL): n-gram Models:

Estimate n-gram Parameters Maximum Likelihood (ML) estimate: • Best on training data: lowest PPL Data sparseness problem: n=3, |V|=10k, a trillion words needed Zero probability for almost all test data!

Neural network: represent words in real space , use exponential model Dealing with Sparsity • Smoothing: use lower order statistics Word clustering: reduce the size of V History clustering: reduce the number of histories Maximum entropy: use exponential models

Smoothing Techniques Good smoothing techniques: Deleted Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN) • Kneser-Ney: consistently the best [Chen & Goodman, 1998]

Decision Tree Language Models Goal: history clustering by a binary decision tree (DT) • Internal nodes: a set of histories, one or two questions • Leaf nodes: a set of histories • Node splitting algorithms • DT growing algorithms

{ab,ac,bc,bd} a:3 b:2 Is the first word ‘a’? Is the first word ‘b’? {ab,ac} a:2 b:1 {bc,bd} a:1 b:1 Example DT New event ‘cba’: Stuck! Training data: aba, aca, acb, bcb, bda

Previous Work • DT is an appealing idea: deal with data sparseness • [Bahl, et al 1989] 20 words in histories, slightly better than 3-gram • [Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram • Both are top-down with a stopping criterion • Why doesn’t it work in practice? • Training data fragmentation: data sparseness • No theoretically founded stopping criterion: early termination • Greedy algorithms: early termination

Random Forests • [Amit & Geman 1997] shape recognition with randomized trees • [Ho 1998] random subspace • [Breiman 2001] random forests Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.

Our Goal Main problems: • Data sparseness • Smoothing • Early termination • Greedy algorithms Expectations from Random Forests: • Less greedy algorithms: randomization and voting • Avoid early termination: randomization • Conquer data sparseness: voting

Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram: general approach • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

General DT Growing Approach • Grow a DT until maximum depth using training data • Perform no smoothing during growing • Prune fully grown DT to maximize heldout data likelihood • Incorporate KN smoothing during pruning

Node Splitting Algorithm Questions: about identities of words in the history Definitions: • H(p): the set of histories in a node p • position: distance from a word in the history to predicted word • i(v) : the set of histories with word v in position • split:non-emptysets Ai and Bi, consists of i(v) • L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies

Node Splitting Algorithm Algorithm Sketch: • For each position i • Initialization: Ai, Bi • For each i(v) Ai • Tentatively move i(v) to Bi • Calculate log-likelihood increase L(Ai- i(v)) - L(Ai) • If the increase is positive, move i(v) and modify counts • Carry out the same for each i(v) Bi • Repeat b)-c) until no move is possible • Split the node according to the best position: the increase in log-likelihood is the largest

Pruning a Decision Tree Smoothing: Define: • L(p): set of all leaves rooted in p • LH(p): smoothed heldout data log-likelihood in p • LH(L(p)): smoothed heldout data log-likelihood in L(p) • potential : LH(L(p)) - LH(p) Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)

Generating random forests LM: • M decision trees are grown randomly • Each DT generates a probability sequence on test data • Aggregation: Towards Random Forests Randomized question selection: • Randomized initialization: Ai, Bi • Randomized position selection

Remarks on RF-LM Random Forest Language Model (RF-LM) : A collection of randomly constructed DT-LMs • A DT-LM is an RF-LM: small forest • An n-gram LM is a DT-LM: no pruning • An n-gram LM isan RF-LM! • Single compact model

A Parse Tree

The Structured Language Model (SLM)

Partial Parse Tree

SLM Probabilities Joint probability of words and parse: Word probabilities:

Using RFs for the SLM Ideally: running the SLM one time Parallel approximation: running the SLM M times Aggregate M probability sequences

Experiments Goal: Compare with Kneser-Ney (KN) • Perplexity (PPL): • UPenn Treebank: 1 million words training, 82k test • Normalized text • Word Error Rate (WER): • WSJ text: 20 or 40 million words training • WSJ DARPA’93 HUB1 test data: 213 utterances, 3446 words • N-best rescoring: standard trigram baseline on 40 million words

Experiments: trigram perplexity • Baseline: KN-trigram • No randomization: DT-trigram • 100 random DTs: RF-trigram

Experiments: Aggregating • Improvements within 10 trees!

Experiments: Why does it work? seen event: • KN-trigram: in training data • DT-trigram: in training data • RF-trigram: in training data for any m

Experiments: SLM perplexity • Baseline: KN-SLM • 100 random DTs for each of the components • Parallel approximation • Interpolate with KN-trigram

Experiments: speech recognition • Baseline: KN-trigram, KN-SLM • 100 random DTs for RF-trigram, RF-SLM-P (predictor) • Interpolate with KN-trigram (40M)

Conclusions • New RF language modeling approach More general LM: RF  DT  n-gram Randomized history clustering: non-reciprocal data sharing Good performance in PPL and WER Generalize well to unseen data Portable to other tasks

Future Work • Random samples of training data • More linguistically oriented questions • Direct implementation in the SLM • Lower order random forests • Larger test data for speech recognition • Language model adaptation

Thank you!

Random Forests for Language Modeling

Random Forests for Language Modeling

Presentation Transcript

Random Forests

Network Intrusion Detection Using Random Forests

Language Modeling

Language Modeling

Language Modeling

Language Modeling

Language Modeling

Random Forests and the Data Sparseness Problem in Language Modeling

Language Modeling

Language modeling

Using Random Forests Language Models in IBM RT-04 CTS

Random Forests for Language Modeling

Reduce Instrumentation Predictors Using Random Forests

Modeling Grow and Yield for Lake States Forests

Network Intrusion Detection Using Random Forests

Language Modeling

Language modeling for speaker recognition

Language modeling

Language Modeling for Speech Recognition

RANDOM FORESTS