Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing

Topics • Word embeddings • Recurrent neural networks • Long-short-term memory networks • Neural machine translation • Automatically generating image captions

Word meaning in NLP • How do we capture meaning and context of words? Synonyms: Synechdoche: “I loved the movie.” “Today, Washington affirmed “I adored the movie.” its opposition to the trade pact.” Homonyms: “I deposited the money in the bank.” “I buried the money in the bank.” Polysemy: “I read a book today.” “I wasn’t able to book the hotel room.”

Word Embeddings “One of the most successful ideas of modern NLP”. One example: Google’s Word2Vecalgorithm

Word2Vec algorithm . . . . . . . . .

Word2Vec algorithm . . . . . . . . . Input: One-hot representation of input word over vocabulary 10,000 units

Word2Vec algorithm Hidden layer (linear activation function) 300 units . . . . . . . . . Input: One-hot representation of input word over vocabulary 10,000 units

Word2Vec algorithm Output: Probability (for each word wi in vocabulary) that wiis nearby the input word in a sentence. 10,000 units Hidden layer (linear activation function) 300 units . . . . . . . . . Input: One-hot representation of input word over vocabulary 10,000 units

Word2Vec algorithm Output: Probability (for each word wi in vocabulary) that wiis nearby the input word in a sentence. 10,000 units 300 × 10,000 weights Hidden layer (linear activation function) 300 units . . . 10,000 × 300 weights . . . . . . Input: One-hot representation of input word over vocabulary 10,000 units

Word2Vec training • Training corpus of documents • Collect pairs of nearby words • Example “document”: Every morning she drinks Starbucks coffee. Training pairs (window size = 3): (every, morning) (morning, drinks) (drinks, Starbucks) (every, she) (she, drinks) (drinks, coffee) (morning, she) (she, Starbucks) (Starbucks, coffee)

Target (probability that “Starbucks” is nearby “drinks”) Word2Vec training via backpropagation Starbucks . . . 300 × 10,000 weights Linear activation function 10,000 × 300 weights . . . . . . . . . drinks

Target (probability that “coffee” is nearby “drinks”) Word2Vec training via backpropagation coffee . . . 300 × 10,000 weights Linear activation function 10,000 × 300 weights . . . . . . . . . drinks

Learned word vectors . . . . . . 10,000 × 300 weights . . . . . . drinks

Some surprising results of word2vec http://www.aclweb.org/anthology/N13-1#page=784

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Word embeddings demo http://bionlp-www.utu.fi/wv_demo/

Recurrent Neural Network (RNN) From http://axon.cs.byu.edu/~martinez/classes/678/Slides/Recurrent.pptx

Recurrent Neural Network “unfolded” in time From http://eric-yuan.me/rnn2-lstm/ Training algorithm: “Backpropagation in time”

Encoder-decoder (or “sequence-to-sequence”) networks for translation http://book.paddlepaddle.org/08.machine_translation/image/encoder_decoder_en.png

Problem for RNNs: learning long-term dependencies. “The cat that my mother’s sister took to Hawaii the year before last when you were in high school is now living with my cousin.” Backpropagation through time: problem of vanishing gradients

Long Short Term Memory (LSTM) • A “neuron” with a complicated memory gating structure. • Replaces ordinary hidden neurons in RNNs. • Designed to avoid the long-term dependency problem

Long-Short-Term-Memory (LSTM) Unit Simple RNN (hidden) unit From https://deeplearning4j.org/lstm.html LSTM (hidden) unit

Comments on LSTMs • LSTM unit replaces simple RNN unit • LSTM internal weights still trained with backpropagation • Cell value has feedback loop: can remember value indefinitely • Function of gates (“input”, “forget”, “output”) is learned via minimizing loss

Google “Neural Machine Translation”: (unfolded in time) From https://arxiv.org/pdf/1609.08144.pdf

Neural Machine Translation:Training: Maximum likelihood, using gradient descent on weights Trained on very large corpus of parallel texts in source (X) and target (Y) languages.

How to evaluate automated translations? Human raters’ side-by-side comparisons: Scale of 0 to 6 0: “completely nonsense translation” 2: “the sentence preserves some of the meaning of the source sentence but misses significant parts” 4: “the sentence retains most of the meaning of the source sentence, but may have some grammar mistakes” 6: “perfect translation: the meaning of the translation is completely consistent with the source, and the grammar is correct.”

Results from Human Raters

Automating Image Captioning

Automating Image Captioning CNN features Training: Large dataset of image/caption pairs from Flickr and other sources Softmax probability distribution over vocabulary Word embeddings Words in caption Vinyals et al., “Show and Tell: A Neural Image Caption Generator”, CVPR 2015

“NeuralTalk” sample results From http://cs.stanford.edu/people/karpathy/deepimagesent/generationdemo/

Microsoft Captionbot https://www.captionbot.ai/

From Andrej Karpathy’s Blog, Oct. 22, 2012: “The State of Computer Vision and AI: We are Really, Really Far Away.” What knowledge do you need to understand this situation? http://karpathy.github.io/2012/10/22/state-of-computer-vision/

Microsoft CaptionBot.ai: “I can understand the content of any photograph and I’ll try to describe it as well as any human.”

Winograd Schema “Common Sense” Challenge

Winograd Schema “Common Sense” Challenge I poured water from the bottle into the cup until it was full. What was full? I poured water from the bottle into the cup until it was empty. What was empty? Winograd Schemas (Levesque et al., 2011)

Winograd Schema “Common Sense” Challenge The steel ball hit the glass table and it shattered. What shattered? The glass ball hit the steel table and it shattered. What shattered? Winograd Schemas (Levesque et al., 2011)

State-of-the-art AI: ~60% (vs. 50% with random guessing) Humans: 100% (if paying attention)

State-of-the-art AI: ~60% (vs. 50% with random guessing) Humans: 100% (if paying attention) “When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.” — Oren Etzioni, Allen Institute for AI

https://www.seattletimes.com/business/technology/paul-allen-invests-125-million-to-teach-computers-common-sense/https://www.seattletimes.com/business/technology/paul-allen-invests-125-million-to-teach-computers-common-sense/ https://allenai.org/alexandria/

https://www.darpa.mil/news-events/2018-10-11 Today’s machine learning systems are more advanced than ever, capable of automating increasingly complex tasks and serving as a critical tool for human operators. Despite recent advances, however, a critical component of Artificial Intelligence (AI) remains just out of reach – machine common sense. Defined as “the basic ability to perceive, understand, and judge things that are shared by nearly all people and can be reasonably expected of nearly all people without need for debate,” common sense forms a critical foundation for how humans interact with the world around them. Possessing this essential background knowledge could significantly advance the symbiotic partnership between humans and machines. But articulating and encoding this obscure-but-pervasive capability is no easy feat. “The absence of common sense prevents an intelligent system from understanding its world, communicating naturally with people, behaving reasonably in unforeseen situations, and learning from new experiences,” said Dave Gunning, a program manager in DARPA’s Information Innovation Office (I2O). “This absence is perhaps the most significant barrier between the narrowly focused AI applications we have today and the more general AI applications we would like to create in the future.”

Allen AI Institute Common Sense Challenge • Which factor will most likely cause a person to develop a fever? (A) a leg muscle relaxing after exercise (B) a bacterial population in the bloodstream (C) several viral particles on the skin (D) carbohydrates being digested in the stomach

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing

Presentation Transcript

Global Inference in Learning for Natural Language Processing

Natural Language Processing and Machine Learning

Machine Learning for Natural Language Processing

Supervised and Unsupervised learning for Natural language processing

Declarative Learning Models for Natural Language Processing

Machine Learning Natural Language Processing

Deep learning and applications to Natural language processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing