1 / 62

Deep Learning RUSSIR 2017 – Day 3

This recap explores how deep learning can handle sequential data such as text, images, and handwriting. The discussion covers machine translation, image captioning, question answering, and handwriting recognition. Discover the challenges and potential applications of deep learning for analyzing and generating sequential data.

dhowe
Download Presentation

Deep Learning RUSSIR 2017 – Day 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep LearningRUSSIR 2017 – Day 3 Efstratios Gavves

  2. Recap from Day 2 • Deep Learning is good not only for classifying things • Structured prediction is also possible • Multi-task structure prediction allows for unified networks and for missing data • Discovering structure in data is also possible

  3. Sequential Data • So far, all tasks assumed stationary data • Neither all data, nor all tasks are stationary though

  4. Sequential Data: Text What

  5. Sequential Data: Text What about

  6. Sequential Data: Text What about inputs that appear in neural sequences, such as Could a text? network handle such modalities?

  7. Memory needed What about inputs that appear in neural sequences, such as Could a text? network handle such modalities?

  8. Sequential data: Video

  9. Quiz: Other sequential data?

  10. Quiz: Other sequential data? • Time series data • Stock exchange • Biological measurements • Climate measurements • Market analysis • Speech/Music • User behavior in websites • ...

  11. Applications Click to go to the video in Youtube Click to go to the website

  12. Machine Translation <EOS> сегодня хорошая Погода LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good <EOS> Погода сегодня хорошая Decoder Encoder • The phrase in the source language is one sequence • “Today the weather is good” • The phrase in the target language is also a sequence • “Погода сегодня хорошая”

  13. Image captioning Today the weather is good <EOS> Convnet LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good • An image is a thousand words, literally! • Pretty much the same as machine transation • Replace the encoder part with the output of a Convnet • E.g. use Alexnet or a VGG16 network • Keep the decoder part to operate as a translator

  14. Demo Click to go to the video in Youtube

  15. Question answering Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room did John enter? A: John entered the living room. Q: what are the people playing? A: They play beach football • Bleeding-edge research, no real consensus • Very interesting open, research problems • Again, pretty much like machine translation • Again, Encoder-Decoder paradigm • Insert the question to the encoder part • Model the answer at the decoder part • Question answering with images also • Again, bleeding-edge research • How/where to add the image? • What has been working so far is to add the image only in the beginning

  16. Demo Click to go to the website

  17. Handwriting Click to go to the website

  18. Recurrent Networks Output Recurrentconnections NN CellState Input

  19. Sequences • Next data depend on previous data • Roughly equivalent to predicting what comes next What about inputs that appear in neural sequences, such as Could a text? network handle such modalities?

  20. Why sequences? I think, therefore, I am! Everything is repeated, in a circle. History is a master because it teaches us that it doesn't exist. It's the permutations that matter. Parameters are reused  Fewer parameters  Easier modelling Generalizes well to arbitrary lengths  Not constrained to specific length However, often we pick a “frame” instead of an arbitrary length

  21. Quiz: What is really a sequence? • Data inside a sequence are … ? McGuire Bond Bond tired I am , James Bond am !

  22. Quiz: What is really a sequence? • Data inside a sequence are non i.i.d. • Identically, independently distributed • The next “word” depends on the previous “words” • Ideally on all of them • We need context, and we need memory! • How to model context and memory ? McGuire Bond Bond Bond tired I am , James Bond am !

  23. One-hot vectors I I I I I Vocabulary One-hot vectors 0 0 0 1 am am am am am 0 0 0 1 Bond Bond Bond Bond Bond 1 0 0 0 James James James James James 0 1 0 0 tired tired tired tired tired tired 0 0 0 0 , , , , , 0 0 0 0 0 McGuire McGuire McGuire McGuire McGuire 0 0 0 0 0 0 0 ! ! ! ! ! A vector with all zeros except for the active dimension 12 words in a sequence  12 One-hot vectors After the one-hot vectors apply an embedding • Word2Vec, GloVE

  24. Quiz: Why not just indices? OR? Index representation One-hot representation I am McGuire I am McGuire James James 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

  25. Quiz: Why not just indices? • No, because then some words get closer together for no good reason (artificial bias between words) OR? Index representation One-hot representation I am McGuire I am McGuire James James 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

  26. Memory A representation of the past A memory must project information at timestep on a latent space using parameters Then, re-use the projected information from at Memory parameters are shared for all timesteps

  27. Memory as a Graph Output Output parameters Memoryparameters Memory Input parameters Input • Simplest model • Input with parameters • Memory embedding with parameters • Output with parameters

  28. Memory as a Graph Output Output parameters Memoryparameters Memory Input parameters Input • Simplest model • Input with parameters • Memory embedding with parameters • Output with parameters

  29. Memory as a Graph Output Output parameters Memoryparameters Memory Input parameters Input • Simplest model • Input with parameters • Memory embedding with parameters • Output with parameters

  30. Folding the memory Folded Network Unrolled/Unfolded Network

  31. Recurrent Neural Network (RNN) Only two equations

  32. RNN Example • Vocabulary of 5 words • A memory of 3 units • Hyperparameter that we choose like layer size • An input projection of 3 dimensions • An output projections of 10 dimensions

  33. Rolled Network vs. MLP? Final output “Layer/Step” 3 “Layer/Step” 2 “Layer/Step” 1 3-layer Neural Network 3-gram Unrolled Recurrent Network • What is really different? • Steps instead of layers • Step parameters shared in Recurrent Network • In a Multi-Layer Network parameters are different

  34. Rolled Network vs. MLP? Final output “Layer/Step” 3 “Layer/Step” 2 “Layer/Step” 1 3-layer Neural Network 3-gram Unrolled Recurrent Network • What is really different? • Steps instead of layers • Step parameters shared in Recurrent Network • In a Multi-Layer Network parameters are different

  35. Rolled Network vs. MLP? • Sometimes intermediate outputs are not even needed • Removing them, we almost end up to a standard Neural Network Final output “Layer/Step” 3 “Layer/Step” 2 “Layer/Step” 1 3-layer Neural Network 3-gram Unrolled Recurrent Network • What is really different? • Steps instead of layers • Step parameters shared in Recurrent Network • In a Multi-Layer Network parameters are different

  36. Rolled Network vs. MLP? “Layer/Step” 3 “Layer/Step” 2 “Layer/Step” 1 Final output 3-layer Neural Network 3-gram Unrolled Recurrent Network • What is really different? • Steps instead of layers • Step parameters shared in Recurrent Network • In a Multi-Layer Network parameters are different

  37. Training Recurrent Networks • Cross-entropy loss • Backpropagation Through Time (BPTT) • Again, chain rule • Only difference: Gradients survive over time steps

  38. Training RNNs is hard • Vanishing gradients • After a few time steps the gradients become almost 0 • Exploding gradients • After a few time steps the gradients become huge • Can’t capture long-term dependencies

  39. Alternative formulation for RNNs An alternative formulation to derive conclusions and intuitions

  40. Another look at the gradients long-term factors short-term factors The RNN gradient is a recursive product of

  41. Exploding/Vanishing gradients Vanishing gradient Exploding gradient

  42. Vanishing gradients The gradient of the error w.r.t. to intermediate cell

  43. Vanishing gradients The gradient of the error w.r.t. to intermediate cell Long-term dependencies get exponentially smaller weights

  44. Gradient clipping for exploding gradients Pseudocode • Scale the gradients to a threshold 1. 2. if: else: print(‘Do nothing’) • Simple, but works!

  45. Rescaling vanishing gradients? • Not good solution • Weights are shared between timesteps Loss summed over timesteps • Rescaling for one timestep () affects all timesteps • The rescaling factor for one timestep does not work for another

  46. More intuitively +

  47. More intuitively + • Let’s say , • If rescaled to 1  • Longer-term dependencies negligible • Weak recurrent modelling • Learning focuses on the short-term only

  48. Recurrent networks Chaotic systems

  49. Fixing vanishing gradients Pascanu et al., On the diffcultyof training Recurrent Neural Networks, 2013 • Regularization on the recurrent weights • Force the error signal not to vanish • Advanced recurrent modules • Long-Short Term Memory module • Gated Recurrent Unit module

  50. Advanced Recurrent Nets Output Input

More Related