1 / 31

Analysis of a Neural Language Model

Analysis of a Neural Language Model. Eric Doi CS 152: Neural Networks Harvey Mudd College. Project Goals. Implement a neural network language model Perform classification between English and Spanish (scrapped) Produce results supporting work by Bengio et. al Interpret learned parameters.

thetis
Download Presentation

Analysis of a Neural Language Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

  2. Project Goals • Implement a neural network language model • Perform classification between English and Spanish (scrapped) • Produce results supporting work by Bengio et. al • Interpret learned parameters

  3. Review Problem: Modeling the joint probability function of sequences of words in a language to make predictions which word maximizes ?

  4. Review Problem: Modeling the joint probability function of sequences of words in a language to make predictions which word maximizes ? Exercise 1: US president has "no hard feelings" about the Iraqi journalist who flung _______

  5. Review Problem: Modeling the joint probability function of sequences of words in a language to make predictions which word maximizes ? Exercise 1: US president has "no hard feelings" about the Iraqi journalist who flung shoes

  6. Review Problem: Modeling the joint probability function of sequences of words in a language to make predictions which word maximizes ? Exercise 2: in an algorithm that seems to 'backpropagate errors', ______

  7. Review Problem: Modeling the joint probability function of sequences of words in a language to make predictions which word maximizes ? Exercise 2: in an algorithm that seems to 'backpropagate errors', hence

  8. Review Conditional probability N-Gram assumption

  9. Review • N-gram does handle sparse data well • However, there are problems: • Narrow consideration of context (~1–2 words) • Does not consider semantic/grammatical similarity: “A cat is walking in the bedroom” “A dog was running in a room”

  10. Neural Network Approach • The general idea: 1. Associate with each word in the vocabulary (e.g. size 17,000) a feature vector (30–100 features) 2. Express the joint probability function of word sequences in terms of feature vectors 3. Learn simultaneously the word feature vectors and the parameters of the probability function

  11. Data Preparation • Input text needs/benefits from preprocessing • Treat punctuation as words • Ignore case • Strip any irrelevant data • Assemble vocabulary • Combine infrequent words (e.g. frequency ≤ 3) • Encode numerically

  12. Data Preparation • Parliament proceedings

  13. Neural Architecture C, a word -> Feature vector table A neural network learning the function

  14. Feature Vector Lookup Table Like a shared one-hot encoding layer

  15. Neural network • Optional direct connections • Note, feature vectors are the only connection to words • Hidden layer models interactions

  16. Final Layer • High amount of computation • Final layer passes through softmax normalization

  17. Parameters

  18. Training • We want to find parameters that maximize the training corpus log-likelihood: • Regularization term • Run through the full sequence, moving the viewing window

  19. Training • Perform stochastic (on-line) gradient ascent using backpropagation • Learning rate decreases as

  20. Results • Perplexity as a measure of success: = geometric avg of • Measures surprise; a perplexity of 10 means as surprised as when presented with 1 of 10 equally probable outcomes. • Perplexity = 1 => perfect prediction • Perplexity ≥ V => failure

  21. Set 1: Train 1000, Test 1000, V = 82

  22. Set 2: Train 10000, Test 1000, v = 413

  23. Unigram Modeling • Bias values of the output layer reflect the overall frequencies of the words • Looking at output words with the highest bias values:

  24. Analyzing features: m = 2 • Looked at highest/lowest 10 for both features • Considered the role of overall frequency • *rare_word* 5 times as frequent as ‘the,’ but not correlated to high feature values

  25. Analyzing features: m = 2 F1 High session like we all one at of during you thursday F1 Low part madam i . this once ' agenda sri a F2 High mr the s been of have can a like once F2 Low part with , that know not which during as one

  26. Analyzing features: m = 2 F1 High would a be on all should which madam to the F1 Low president . that i year you session it who one F2 High you president and parliament like , that case year if F2 Low a the be - mr mrs i have there for

  27. Analyzing features: m = 2 F1 High to have on the and madam not been that in F1 Low we however i before members president do which principle would F2 High the a this you now s - president be i F2 Low , which in order should been parliament shall request because

  28. Difficulties • Computation-intense; hard to run thorough tests

  29. Future Work • Simpler sentences • Clustering to find meaningful groups of words in higher feature dimensions • Search across multiple neural networks

  30. References • Bengio, “A Neural Probabilistic Language Model.” 2003. • Bengio, “Taking on the Curse of Dimensionality in Joint Distributions Using Neural Networks. 2000.

  31. Questions?

More Related