1 / 46

Multimodal Deep Learning

Multimodal Deep Learning. Jiquan Ngiam Aditya Khosla , Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University. McGurk Effect. Audio-Visual Speech Recognition. Feature Challenge. Classifier (e.g. SVM). Representing Lips.

landgraf
Download Presentation

Multimodal Deep Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimodal Deep Learning JiquanNgiam AdityaKhosla, Mingyu Kim, Juhan Nam, HonglakLee & Andrew Ng Stanford University

  2. McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  3. Audio-Visual Speech Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  4. Feature Challenge Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  5. Representing Lips • Can we learn better representations for audio/visual speech recognition? • How can multimodal data (multiple sources of input) be used to find better features? Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  6. Unsupervised Feature Learning 5 1.1 . . . 10 9 1.67 . . . 3 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  7. Unsupervised Feature Learning 5 1.1 . . . 10 9 1.67 . . . 3 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  8. Multimodal Features 1 2.1 5 9 . . . . . . . 6.5 9 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  9. Cross-Modality Feature Learning 5 1.1 . . . 10 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  10. Feature Learning Models Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  11. Feature Learning with Autoencoders Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  12. Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  13. Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  14. Shallow Learning • Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Hidden Units Audio Input Video Input

  15. Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  16. Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  17. Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  18. Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  19. Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  20. Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  21. Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Phonemes” Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  22. Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  23. Video Reconstruction Training Bimodal Deep Autoencoder Audio Reconstruction Video Reconstruction Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Shared Representation Shared Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... • Train a single model to perform all 3 tasks • Similar in spirit to denoisingautoencoders • (Vincent et al., 2008) Audio Input Video Input Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  24. Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  25. Visualizations of Learned Features 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 ms Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  26. Video Reconstruction Lip-reading with AVLetters Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • AVLetters: • 26-way Letter Classification • 10 Speakers • 60x80 pixels lip regions • Cross-modality learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  27. Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  28. Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  29. Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  30. Video Reconstruction Lip-reading with CUAVE Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • CUAVE: • 10-way Digit Classification • 36 Speakers • Cross Modality Learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  31. Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  32. Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  33. Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  34. Video Reconstruction Multimodal Recognition Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng • CUAVE: • 10-way Digit Classification • 36 Speakers • Evaluate in clean and noisy audio scenarios • In the clean audio scenario, audio performs extremely well alone

  35. Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  36. Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  37. Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  38. Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  39. Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Method: Learned Features + Canonical Correlation Analysis

  40. McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  41. McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  42. Video Reconstruction Conclusion Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Learned Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Video Input Audio Input Video Input • Applied deep autoencoders to discover features in multimodal data • Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue • Multimodal Feature Learning: Learn representations that relate across audio and video data Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  43. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  44. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  45. Bimodal Learning with RBMs Hidden Units ... ... …... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

More Related