530 likes | 699 Views
Random Forests and the Data Sparseness Problem in Language Modeling. Shih-Hsiang Lin ( 林士翔 ) Department of Computer Science & Information Engineering National Taiwan Normal University. Reference:
E N D
Random Forests and the Data Sparseness Problem in Language Modeling Shih-Hsiang Lin (林士翔)Department of Computer Science & Information EngineeringNational Taiwan Normal University Reference: 1. P. Xu and F. Jelinek, “Random Forests and the Data Sparseness Problem in Language Modeling,” Comp. Speech and Language, 21, 2007 2. L. Bahl et al., “A Tree-based Statistical Language Model for Natural Language Speech Recognition,” IEEE Trans. on Acoustic, Speech and Signal Processing, 37, 1989 3. P. Chou, “Optimal Partitioning for Classification and Regression Trees,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 13, 1991 4. G. Potamianos and F. Jelinek, “A Study of n-gram and Decision Tree Letter Language Modeling Methods,” Speech communication, 24, 1998
Outline • Introduction • N-gram language models, Data sparseness problem, Smoothing , Alternative ways, Beyond n-gram models … etc. • Decision tree language models
Introduction is an equivalence classifier • The purpose of a language model is to estimate the probability of a word string • We need a training corpus to estimate the probabilities • It is clear that the number of probabilities to be estimated and stored would be prohibitively large • Therefore, histories for word are usually grouped into equivalence classes • Once is defined, the task of language modeling is to estimate the parameters reliably from the training corpus • Language models should assign nonzero probabilities to any words following any histories
Introduction: n-Gram language models Example: i=3, n=3 國立 台灣 師範 大學 國立 台灣 師範 國立 台灣 • The most widely used language models, n-gram language models, use the identities of the last n-1 words as equivalence classes • denotes the word sequence which we normally call the n-gram history • The value n of an n-gram model is also the order of the model • Models with order less than n are referred to as lower order models • e.g. trigram (n=3), bigram (n=2), unigram (n=1) • The maximum likelihood (ML) estimate of is
Introduction: Data sparseness problem Data Set: UPENN Treebank Training Data: 1 million words Test Data: 82 thousand words • The number of parameters in an n-gram model grows exponentially as the order n increases • e.g., a vocabulary of size 10,000 in a trigram model, the total number of probabilities to be estimated is • For any training data of manageable size, many of the probabilities will be zero if the ML estimate is used • The situation would have been much worse if only out-of-domain training data were available • Without some proper treatment, the ML estimated n-gram models cannot be used in any practical applications • Therefore, we have to deal with the trade-off between power and generalization, or the bias and variance trade-off
Introduction: Smoothing • If a language model can assign nonzero probabilities to any word string, we call it a smoothed language model • Any technique that results in a smoothed language model is called a smoothing technique • add-one smoothing, add-delta smoothing, n-gram back-off, interpolated model • Typically, smoothing is achieved by taking out some probability mass from seenn-grams and distributing this mass to unseenn-grams • Most of the smoothing techniques use lower order probabilities to smooth an n-gram language model • Smoothing is often recursively defined • Among the smoothing techniques reviewed, Kneser–Ney smoothing is consistently the best across different data sizes and domain • Subtracts nonzero n-gram counts with a fixed discount value
Introduction: Alternative ways • Use clusters of words to reduce the number of parameters in n-gram models • where denotes a cluster • However, the number of clusters has to be experimentally search for (brute-force) • Decision trees have also been used for language models • The decision tree is used to perform history equivalence classification • Language models in an exponential form assign nonzero probability to any n-gram • e.g., maximum entropy, neural network language models, discriminative language model
Introduction: Beyond n-gram models • A fundamental flaw in n-gram models : long distance dependencies cannot be captured • Increasing the order n does not lead to better models • long distance dependencies are often many words apart and the data sparseness problem undermines the possible benefit as n grows • The structured language model (SLM) uses syntactic information beyond the regular n-gram models to capture some long distance dependencies • The syntactic analysis of a partial sentence can be seen as an equivalence classification of the history • based on statistical parsing techniques which allow syntactic analysis of sentences
Introduction: Getting more data from the web • More and more data are also becoming available in the form of the World Wide Web • Search engines have more than one billion web pages in their database • With such large amount of data, it seemed hard to argue that the data sparseness problem still exists • Significant word error rate reductions were reported due to the additional training data • However, it still remains an open question whether improvements in smoothing is orthogonal to getting more data
Decision tree language models • In the typical classification paradigm, we want to assign each data point a class label , where is a finite set of labels • We are normally given labeled data and label pairs as the training data from which we need to learn a classifier • The problem of language modeling can also be seen as a classification problem in which • corresponds to a history • corresponds to a word • The training data of length can be considered as pairs • Instead of assigning to every datum a ‘‘label’’, we want to estimate the probability distribution of all words given , the n-gram history • High dimensionality makes language modeling a very special and difficult classification problem
Decision tree classifiers • Why decision tree • The robustness to noisy data • The ability of learning disjunctive expressions • A decision tree is a graph consisting of a finite set of internal nodes and a finite set of terminal or leaf nodes • Each internal node has mutually exclusive questions andchild nodes • The questions concern about properties of data points • YES answers to the questions lead to child nodes among the • If a data point answers NO to all questions, it will proceed to the last child node
Construction of decision trees • Decision trees are normally constructed based on some training data using greedy top-down algorithms • (1) Start with a single node and all training data. • (2) If all training data in the current node belong to the same class, then mark the current node with that class label and make the current node a leaf. • (3) Otherwise, score all possible questions using some goodness measure. • (4) Choose the question that achieves the best score, split the current node according to the question into child nodes. The new nodes are new candidates for splitting. • (5) Partition the training data according to answers to the question and pass the data to the corresponding child nodes. • (6) If there are no more candidate nodes, stop. Otherwise go to 2.
Construction of decision trees (cont.) • From the procedure above, we can see that there are two basic issues • Goodness measure • Possible questions for each candidate node • Furthermore, the above greedy procedure is biased toward training data • This can lead to serious overfitting problem, especially when the size of training data is small • In order to avoid such overfitting problem, decision tree construction can be • stopped before it perfectly fits the training data • or post-pruned based on some validation (heldout) data • Along with the three main issues related to general decision tree, decision tree language models have another key difficulty • large dimensionality in both the data/label space (data sparseness)
Decision tree language modelsDefinitions and notations (cont.)
Decision tree language modelsGoodness measure: entropy • In the decision tree construction for language models, we always use entropy as the goodness measure for selecting questions, and as a stopping criterion • If we use relative frequencies as probabilities • The empirical conditional entropy of development data with size is defined as • and the cross-entropy of heldout data with size is • Since we have the data sparseness problem, for some word that occurs in the heldout data, it could happen that , and we would have • smoothed probabilities should be used instead of the relative frequencies
Decision tree language modelsGoodness measure: entropy (cont.) • It can be shown that the entropy can always be reduced by splitting a leaf node into two new leaf nodes • Suppose a leaf node is split into and and the resulting entropy is • The difference in entropy satisfies
Decision tree language modelsGoodness measure: entropy (cont.) • Equality holds if and only if the distributions induced in both nodes are exactly the same as the distribution induced in the original node • Therefore, no matter how we split a node into two leaf nodes and the entropy will be decreased or kept the same • As a result, if we just use the entropy of the development data as our goodness measure, the decision tree construction will not stop • until either each leaf has only one history or all histories in every leaf node have the same conditional distribution • This will lead to a decision tree that overfits the development data
Decision tree language modelsQuestion selection (cont.) equivalent • Potential questions in the internal nodes can ask about any property of those words • In a data driven approach, we assume no external knowledge and will use only n-gram statistics from the training data • Since each potential question splits the histories in a node into two subsets, we know that the total number of possible questions is • This number is prohibitively large, at least in nodes close to the root, as the number of histories in the n-gram case is also large • To limit the number of questions, we can limit the questions to be about a particular position in the history • “Does the word at position in the history belong to a set ?” • “Does the word at position in the history belong to a set ?” • The question above reduce the number of possible question to at most
Decision tree language modelsStopping Rule • When we try to find the best split for each node, we measure the split with both development data entropy and heldout data entropy • If the best split of a node cannot reduce the heldout data entropy, the node will not be split and will be marked as a final leaf node • can control the trade-off between overfittingand generalization power
Decision tree language modelsData sparseness problem • Given the histories in for some node , all words in position form in general only a subset of the entire vocabulary • During test time, if we encounter an unseen history, the word in position j could belong to neither of the two sets • the answer to either question would be NO • the decision to follow either paths for this unseen history would not be based on any evidence from the training data • One way to distinguish such unseen histories is to store two questions in a node instead of one • take special care of unseen history • Even if a history is seen in the training data, we could still have the unseen event problem • In training data a word may not have followed a history or any other histories in its equivalence class.
Decision tree language modelsTheir approach • They use development data likelihood (equivalent to entropy) as their goodness measure both for choosing questions and for the general tree construction • They also use heldout data likelihood (equivalent to cross-entropy) as part of the goodness measure, although not as stopping rule • One positional defect of Chou’s method requires smoothed probabilities • However, the smoothing leads to much higher computational cost • They use another greedy algorithm (exchange algorithm) for finding the best split of a node
Decision tree language modelsTheir approach (cont.) • Assume that we have a node under consideration for splitting • Initially, we split into two nonempty disjoint subsets and • Let us denote the log-likelihood of the development data associated with under the split as • only counts are involved in equitation, an efficient data structure can be used to store them for the computation • Then, we try to find the best subsets and by tentatively moving histories in groups of the basic elements from to , and vice versa
Decision tree language modelsTheir approach (cont.) • Suppose is the element we want to move. The log-likelihood after we move from to can be calculated using above equitation with the following changes • If a tentative move results in an increase in log-likelihood, we will accept the move and modify the counts • The algorithm runs until no single move increases the log-likelihood. Otherwise, the element stays where it was • After all positions in the history are examined, we choose the one with the largest increase in log-likelihood for splitting the node
Decision tree language modelsTheir approach (cont.) • In the literature, the smoothing technique used for DT language models has been focusing on the use of statistics in all nodes along the path a history follows from the root • For example, if is the sequence of all nodes from the root to a leaf node , then the final smoothed probability distribution of the leaf is • where the s are chosen to maximize the probability of some heldout data and • There are some problems with such smoothing • The s for the same node may not be the same because a node can appear in many paths • If an n-gram event has a zero probability in some leaf node, it will have zero probabilities in many other nodes along the path
Decision tree language modelsTheir approach (cont.) • Assume we have a map which maps a history to an equivalence class (a node in a decision tree, or a set of histories) • For any one of the nodes, we can use a form similar to interpolated Kneser–Ney smoothing • The smoothing in the decision tree is exactly the same as interpolated Kneser–Ney smoothing except for the count of the n-gram event and the interpolation weight • Although some histories share the same equivalence class in a decision tree, they can use different lower order probabilities for smoothing if their lower order histories are different
Decision tree language modelsTheir approach (cont.) • We can put any n-gram event into one of the following three categories according to our decision tree smoothing
Decision tree language modelsTheir approach (cont.) • Instead of using heldout data cross-entropy as a stopping rule, they use it as a post-pruning criterion • For each internal node of the decision tree, we compute its potential as the possible gain in heldout data likelihood by growing into • During post-pruning, we traverse the complete decision tree using a preorder visit • For each internal node , we prune the subtree rooted in if its potential is less than a certain threshold • Once the subtree rooted in a node is pruned, the node becomes a new leaf node • Since turning an internal node into a leaf node changes the potentials of all its ancestor nodes, we have to visit its ancestors again to compute the new potentials • Therefore, the post-pruning is carried out iteratively until no more subtrees are pruned • Every pass in the post-pruning is a preorder visit of the decision tree
Decision tree language modelsOpen Problem • Data fragmentation and overfitting • As decision trees are being constructed, the training data is split further and further • Therefore, the algorithms used for splitting a node are based on less and less data and the decisions are no longer as reliable • Since we always use heldout data cross-entropy as a stopping rule or post-pruning criterion, the resulting decision trees are inevitably shaped for the benefit of the heldout data • As our real unseen test data will not be the same as the heldout data, the decision trees constructed this way are not guaranteed to perform well on test data • There are several proposed methods to deal with the data fragmentation problem • cross-validation, statistical significance test, and minimal description length principle
Random forest • Ensembles of classifiers usually have better accuracy than individual classifiers • In bagging, each classifier is trained on a random selection (without replacement) from the examples in the training data • Adaboost weights the training examples based on the output of a base classifier • putting more weights on examples for which the base classifier makes a mistake, the new classifier is expected to adapt to those examples and make a correct classification • Apart from bagging and Adaboost, ensembles of decision trees can be generated by randomizing the decision tree construction • Randomized decision trees are constructed by randomly selecting a subset of the features at each node to search for the best split • And randomly sampling the training data to base the search on
Random forest (cont.) • A random forest is a classifier consisting of a collection of tree-structured classifiers where the are independent identically distributed(i.i.d.) random vectors and each tree casts a unit vote for the most popular class at input • The randomness injected is expected to minimize correlation among the decision trees while maintaining strength • As a result, the random forests achieve better classification accuracy • It is also shown that the accuracy of random forests always converges, by the use of the Strong Law of Large Number • Therefore overfitting is not a problem when there are a large number of decision trees
Randomizing decision tree language modelsSampling training data • Like bagging, the training data used to construct the decision tree language models can be sampled randomly • Suppose we have training data that contains sentences from which we want to sample sentences • can be greater than or smaller, depending on the situation • If we have a very large amount of training data which cannot be used all at once because of practical considerations, we would choose (sampling without replacement) • When is relatively small, we might want for better coverage (sampling with replacement)
Randomizing decision tree language modelsRandom question selection • Random position selection • For each of the positions of the history in an n-gram model, we have a Bernoulli trial with a probability for success • The trials are assumed to be independent of each other • The positions corresponding to successful trials are then passed to the exchange algorithm which will choose the best among them for splitting a node • Random initialization • The initialization in the exchange algorithm is to split the basic elements into two sets and • In order to achieve a random initialization, we flip a fair coin for each basic element • In case one of the two sets is empty, we start over and perform the same procedure again • Since the local optima of the exchange algorithm depend on the initialization, the random initialization is in fact performing a random selection among the local optima
Random forest language models • The randomized version of the decision tree growing algorithm is run many times and finally we get a collection of randomly grown decision trees • For each n-gram, every decision tree language model may have a different probability because the equivalence classifications are different due to the randomization • We aggregate the probabilities of an n-gram event from all decision trees to get the random forest language model • Since each decision tree is a smooth language model, the random forest language model is also a smooth model • Suppose we have randomly grown decision trees, . In the word n-gram case, the random forest language model probabilities can be computed as
Random forest language models (cont.) • A random forest language modelis a language model consisting of a collection of tree structured language mode ls • where are independent identically distributed random variables • each of which representing a decision tree language model and each tree assigns a probability to an n-gram event • The probability of the event assigned by the random forest language model is the average of the probabilities the trees assign • It is worth mentioning that the random forest language model in previous equitation can be seen as a general form of language models • A regular n-gram language model can be seen as a special decision tree language model in which each leaf node has only one history • A decision tree language model can also be seen as a special random forest language model in which there is only one tree
Embedded random forest language models • The use of Kneser–Ney smoothing in lower order probabilities may not be adequate, especially when we apply random forests to higher order n-gram models • Therefore, we should not limit the random forest language models by the use of Kneser–Ney lower order probabilities • One possible and natural solution is to use random forests for lower order probabilities as well • The random forest language models are then recursively defined • The embedded random forest language models should be constructed from bottom up
Experiments with random forest language models • In order to test the effectiveness of the models, they present perplexity (PPL) results in this chapter for various random forest models • Conducted on UPenn Treebank portion of the WSJ corpus • The UPenn Treebank contains 24 sections of hand-parsed sentences, for a total of about one million words • Sections 00–20 (929,564 words) as training data • Sections 21–22 (73,760 words) as heldout data • Sections 23–24 (82,430 words) as test data • The word vocabulary contains 10 thousand words including a special token for unknown words • All of the experimental results in this chapter are based on this corpus and setup
Experiments with random forest language modelsBasic Results • RF-Tree (100 DTs) • Do not sample the training data randomly • The global Bernoulli trial probability for position selection was set to 0.5 • The pruning threshold in DT-Grow-fast was chosen to be 0 • DT-trigram obtained a slightly lower PPL than KN-trigram on heldout data, but was much worse on the test data • This indicates that the DT-trigram overfits the heldout data because of greedy pruning • RF-trigram performed much better on both heldout and test data
Experiments with random forest language modelsAnalysis of effectiveness • In order to analyze why this random forest approach can improve the PPL on test data, they split the events in test data into two categories: seen events and unseen events • Seen Events are defined as follow • For KN-trigram, seen events are those that appear in the training/heldout data • For DT-trigram, a seen event is one whose predicted word is seen in training data following the equivalence class of the history • For RF-trigram, they define seen events as those that are seen events in at least one decision tree among the random collection of decision trees • the RF-trigram reduced the number of unseen events greatly: from 54.4% of the total events to only 8.3%. Although the PPL of remaining unseen events is much higher, the overall PPL is still improved
Experiments with random forest language modelsConvergence of the random forest trigram model • There is no theoretical basis for choosing the number of decision trees in a random forest model • The PPL drops sharply at the beginning and tapers off quite quickly • The PPL of the RF-trigram with less than 10 decision trees is already better than the KN-trigram
Experiments with random forest language modelsRandom question selection • In general, the smaller the global probability is, the more random the resulting decision trees are • When , all positions are considered as candidates to ask questions about and the best position will always be chosen • When , the probability of choosing any position is for the random forest trigram model • As we expected, when , there is almost no randomness in the decision tree construction • Therefore, the corresponding ‘‘random forest’’ trigram is not much better than a single decision tree
Experiments with random forest language models Higher order n-gram random forest models • In higher order n-gram models, there are more positions in the histories than in trigrams and the data sparseness problem is more severe • Therefore, the random forest approach may be more effective • Contrary to their expectation, using random forests for higher order n-gram models did not result in more relative improvements over the trigram model • Since more freedom in random question selection for the higher order n-gram models could not benefit the random forest approach, they believe there must be some fundamental problem that prevents further improvements
Experiments with random forest language modelsEmbedded random forest {3, 4, 5, 6}-gram models Even if there is fluctuation in the results due to the randomness, we see that the embedded random forest 4-gram model results in more relative improvement over the baseline than does the trigram However, table also shows degradation in PPL for the 5-gram and 6-gram models
Experiments with random forest language modelsUsing Random Forest Language Models in ASR • Two automatic speech recognition systems are stuided in this paper • Wall Street Journal (WSJ) • DARPA’93 HUB1 • consists of 213 utterances read from the Wall Street Journal, a total of 3446 words • The WER of the best scoring hypotheses (Trigram LM) from 50-best lists is 13.7%. • Conversational telephony system for rich transcription (CTS-RT) • DARPA EARS 2004 Rich Transcription evaluation (RT-G4)
Experiments with random forest language modelsResults on WSJ corpus • The WER can be a little better than both models when • We can see that under both conditions the random forest approach improved upon the regular KN approach • The improvement in WER using the trigram with 40 million words is not as much as the trigram with 20 million words • A possible reason is that the data sparseness problem is not as severe and the performance advantage of the random forest approach is limited
Experiments with random forest language modelsResults on WSJ corpus (cont.) • As the number of decision trees in the random forest trigram models is chosen arbitrarily • It would be interesting to see the performance of the random forest trigram models when we use different numbers • The best WER result was achieved by using 10 decision trees
Structured language model • Although n-gram models, especially trigram models, are used in most of the state-of-the-art systems, there is a fundamental flaw: long distance dependencies cannot be captured • Increasing the order n does not lead to better models because long distance dependencies are often many words apart and the data sparseness problem undermines the possible benefit as n grows • The structured language model (SLM) uses syntactic information beyond the regular n-gram models to capture long distance dependencies • The SLM is based on statistical parsing techniques which allow syntactic analysis of sentences