460 likes | 578 Views
Genevieve Gorrell 5 th June 2007. Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing. Introduction. Think datapoints plotted in hyperspace Imagine a space in which each word has its own dimension big bad [ 2 1 ] [ 1 1 ] [ 0 1 ]
E N D
Genevieve Gorrell 5th June 2007 Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing
Introduction • Think datapoints plotted in hyperspace • Imagine a space in which each word has its own dimension big bad [ 2 1 ] [ 1 1 ] [ 0 1 ] • We can compare these passagesusing vector representations in this space axis of bigness ”big big bad” ”big bad” ”bad” axis of badness
Dimensionality Reduction axis of bigness • Do we really need two dimensions to describe the relationship between these datapoints? ”big big bad” ”big bad” ”bad” axis of badness
Dimensionality Reduction axis of bigness • Do we really need two dimensions to describe the relationship between these datapoints? ”big big bad” ”big bad” ”bad” axis of badness
Rotation • Imagine the data look like this ...
Rotation • Imagine the data look like this ...
More Rotation • Or even like this ...
More Rotation • Or even like this ... • Because if these were the dimensions we would know which were the most important • We could describe as much of the data as possible using a smaller number of dimensions • approximation • compression • generalisation
More Rotation • Or even like this ... • Because if these were the dimensions we would know which were the most important • We could describe as much of the data as possible using a smaller number of dimensions • approximation • compression • generalisation
More Rotation • Or even like this ... • Because if these were the dimensions we would know which were the most important • We could describe as much of the data as possible using a smaller number of dimensions • approximation • compression • generalisation
Eigen Decomposition • The key lies in rotating the data into the most efficient orientation • Eigen decomposition will give us a set of axes (eigenvectors) of a new space in which our data might more efficiently be represented
Eigen Decomposition • Eigen decomposition is a vector space technique that provides a useful way to automatically reduce data dimensionality • This technique is of interest in natural language processing • Latent Semantic Indexing • Given a dataset in a given space, eigen decomposition can be used to create a nearest approximation in a space with fewer dimensions • For example, document vectors as bags of words in a space with one dimension per word can be mapped to a space with fewer dimensions than one per word Mv = λv
A real world example—eigenfaces • Each new dimension captures something important about the data • The original observation can be recreated from a combination of these components
Eigen Faces 2 • Each eigen face captures as much information in the dataset as possible (eigenvectors are orthogonal to each other) • This is much more efficient than the original representation
More Eigen Face Convergence • Eigen faces with high eigenvalues capture important generalisations in the corpus • These generalisations might well apply to unseen data ...
We have been using this in natural language processing ... • Corpus-driven language modelling suffers from problems with data sparsity • We can use eigen decomposition to make generalisations that might apply to unseen data • But language corpora are very large ...
Problems with eigen decomposition • Existing algorithms often; • require all the data be available at once (batch processing) • produce all the component vectors simultaneously, even though they may not all be necessary and it takes longer to do all of them • are very computationally expensive, therefore may exceed the capabilities of the computer for larger corpora • large RAM requirement • exponential relationship between time/RAM requirement and dataset size
Generalized Hebbian Algorithm (Sanger 1989) • Based on Hebbian learning • Simple localised technique for deriving eigen decomposition • Requires very little memory • Learns based on single observations (for example, document vectors) presented serially, therefore no problem to add more data • In fact, the entire matrix need never be simultaneously available • Greatest are produced first
GHA Algorithm c += (c . x) x • c is the eigenvector, x is the training datum Initialise eigenvector randomly While the eigenvector is not converged { Dot-product each training vector with the eigenvector Multiply the result by the training vector Add the resulting vector to the eigenvector } • Dot-product is a measure of similarity of direction of one vector with another, and produces a scalar • There are various ways in which one might assess convergence
GHA Algorithm Continued • Or in other words, train by adding each datum to the eigenvector proportionally with the extent to which it already resembles it • Train subsequent eigenvectors by removing the stronger eigenvectors from the data before we train, so it doesn’t find those ones
GHA as a neural net Weight_1 += dp Input_1 Input_1 Weight_2 += dp Input_2 Weight_1 Input_2 Weight_2 x=1 dp = Input_x Weight_x n Weight_3 Input_3 Weight_3 += dp Input_3 Weight_n Weight_n += dp Input_n Input_n • Can be extended to learn many eigenvectors
Singular Value Decomposition • Extends eigen decomposition to paired data “bad” “big bad” “big big bad” big bad big:2 bad:2 big 5 3 big:1 1 2 3 3 0 0 bad bad:1 Word co-occurrence Word bigrams
Asymmetrical GHA (Gorrell 2006) • Extends GHA to asymmetrical datasets • allows us to work with n-grams for example • Retains the features of GHA
Asymmetrical GHA Algorithm ca += (cb.xb) xa cb += (ca.xa) xb • Train singular vectors on data presented as a series of vector pairs by dotting left training datum with left singular vector and scaling right singular vector by the resulting scalar and vice versa • for example, first word in a bigram might be vector xa and the second, xb
Asymmetrical GHA Performance (20,000 NL bigrams) • RAM requirement linear with dimensionality and number of singular vectors required • Time per training step linear with dimensionality • This is a big improvement on conventional approaches for larger corpora/dimensionalities ... • But don't forget, the algorithm needs to be allowed to converge
N-Gram Language Model Smoothing • Modelling language as a string of n-grams • highly successful approach • but we will always have problems with data sparsity • zero probabilities are bad news A Zipf Curve
N-gram Language Modelling—An Example Corpus A man hits the ball at the dog. The man hits the ball at the house. The man takes the dog to the ball. A man takes the ball to the house. The dog takes the ball to the house. The dog takes the ball to the man. The man hits the ball to the dog. The man walks the dog to the house. The man walks the dog. The dog walks to the man. A dog hits a ball. The man walks in the house. The man hits the dog. A ball hits the dog. The man walks. A ball hits. Every ball hits. Every dog walks. Every man walks. A man walks. A small man walks. Every nice dog barks.
An Example Corpus as Normalised Bigram Matrix man hits the ball at dog house takes to walks a in small nice barks a 0.03 0.0 0.0 0.03 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 man 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.02 0.0 0.07 0.0 0.0 0.0 0.0 0.0 hits 0.0 0.0 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 the 0.1 0.0 0.0 0.07 0.0 0.10.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ball 0.0 0.03 0.0 0.0 0.02 0.0 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0 at 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 takes 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 dog 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.020.020.02 0.0 0.0 0.0 0.0 0.01 to 0.0 0.0 0.07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 walks 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.01 0.0 0.0 0.0 in 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 every 0.01 0.0 0.0 0.01 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 small 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 nice 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
First Singular Vector Pair man hits the ball at dog house takes to walks a in small nice barks a 0.02 0.00 0.00 0.02 0.00 0.020.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 man 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 hits 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the 0.10 0.00 0.00 0.07 0.00 0.100.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ball 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 at 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 takes 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dog 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 to 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 walks 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 in 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 every 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Second Singular Vector Pair man hits the ball at dog house takes to walks a in small nice barks a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 man 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 hits 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ball 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 at 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 takes 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dog 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 to 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 walks 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 in 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 every 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Third Singular Vector Pair man hits the ball at dog house takes to walks a in small nice barks a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 man 0.00 0.04 0.00 0.00 0.01 0.00 0.00 0.020.020.06 0.00 0.00 0.00 0.00 0.00 hits 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ball 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.010.010.02 0.00 0.00 0.00 0.00 0.00 at 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 takes 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dog 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.010.010.02 0.00 0.00 0.00 0.00 0.00 to 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 walks 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 in 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 every 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Language Models from Eigen N-Grams man hits the ball at dog house takes to walks a in small nice barks a 0.02 0.00 0.00 0.02 0.00 0.020.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 man 0.00 0.04 0.00 0.00 0.01 0.00 0.00 0.020.020.06 0.00 0.00 0.00 0.00 0.00 hits 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the 0.10 0.00 0.00 0.07 0.00 0.100.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ball 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.010.010.02 0.00 0.00 0.00 0.00 0.00 at 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 takes 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dog 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.010.010.02 0.00 0.00 0.00 0.00 0.00 to 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 walks 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 in 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 every 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 • Add k singular vector pairs (“eigen n-grams”) together • Remove all the negative cell values • Normalise row-wise to get probabilities • Include a smoothing approach to remove zeros
What do we hope to see? • Theory is that reduced dimensionality representation better describes the unseen test corpus than the original representation • As k increases perplexity should decrease until the optimum is reached • k should then begin to increase as the optimum is passed and too much data is included • We hope for a U-shaped curve
Some results ... • Perplexity is a measure of the quality of the language model • k is number of dimensions (eigen n-grams) • Times are how long it took to calculate the dimensions
Some specifics about this experiment • The corpus comprises five newsgroups from CMU's newsgroup corpus • Training corpus contains over a million items • Unseen test corpus comprises over 100,000 items • I used AGHA to calculate the decomposition • I used simple heuristically-chosen smoothing constants and single-order language models
Maybe k is too low? • 200,000 trigrams • LAS2 algorithm
Full rank decomposition • 20,000 bigrams • Furthermore perplexity in each case never reaches the baseline of perplexity of the original n-gram model
Linear interpolation may generate an interesting result • Best result is 370 • An overall improvement of 20% is demonstrated • (However, this involved tuning on the test corpus)
200,000 Trigram Corpus • Improvement on the baseline n-gram is even greater on the medium-sized corpus (30%)
1 Million Trigram Corpus • This is a big dataset for SVD! • Needed to increase the weighting on the SVDLM a lot to get a good result
Fine-Tuning k • Tuning k results in a best perplexity of over 40% • A low optimal k is a good thing because many algorithms for calculating SVD produce singular vectors one at a time starting with the largest
Tractability • The biggest challenge with SVDLM is tractability • Calculating SVD is computationally demanding • But optimal k is low • I have also developed an algorithm that helps with tractability • Usability of the resulting SVDLM is also an issue • SVDLM is much larger than regular n-gram • But the size can be minimised by discarding low values with minimal impact on performance
Backoff SVDLM • Improving on n-gram language modelling is interesting work • However no improvement on the state of the art has been demonstrated yet! • Next steps involve creation of a backoff SVDLM • Interpolating with lower-order n-grams is standard • Backoff models have much superior performance
Similar Work • Jerome Bellegarda developed the LSA language model • Uses longer span eigen decomposition information to access semantic information • Others have since developed the work • Saul and Pereira demonstrated an approach based on Markov models • Again demonstrates that some form of dimensionality reduction is beneficial
Summary • GHA-based algorithm allows large datasets to be decomposed • Asymmetrical formulation allows data such as n-grams to be decomposed • Promising initial results in n-gram language model smoothing have been presented
Thanks! • Gorrell, 2006 “Generalized Hebbian Algorithm for Incremental Singular Value Decomposition.” Proceedings of EACL 2006 • Gorrell and Webb, 2005 ”Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis.” Proceedings of Interspeech 2005 • Sanger, T. 1989 ”Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Network.” Neural Networks, 2, 459-473