460 likes | 488 Views
Multimodal Deep Learning. Jiquan Ngiam Aditya Khosla , Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University. McGurk Effect. Audio-Visual Speech Recognition. Feature Challenge. Classifier (e.g. SVM). Representing Lips.
E N D
Multimodal Deep Learning JiquanNgiam AdityaKhosla, Mingyu Kim, Juhan Nam, HonglakLee & Andrew Ng Stanford University
McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Audio-Visual Speech Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Challenge Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Representing Lips • Can we learn better representations for audio/visual speech recognition? • How can multimodal data (multiple sources of input) be used to find better features? Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Unsupervised Feature Learning 5 1.1 . . . 10 9 1.67 . . . 3 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Unsupervised Feature Learning 5 1.1 . . . 10 9 1.67 . . . 3 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Features 1 2.1 5 9 . . . . . . . 6.5 9 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Cross-Modality Feature Learning 5 1.1 . . . 10 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Learning Models Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Learning with Autoencoders Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shallow Learning • Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Hidden Units Audio Input Video Input
Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Phonemes” Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Training Bimodal Deep Autoencoder Audio Reconstruction Video Reconstruction Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Shared Representation Shared Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... • Train a single model to perform all 3 tasks • Similar in spirit to denoisingautoencoders • (Vincent et al., 2008) Audio Input Video Input Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Visualizations of Learned Features 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 ms Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Lip-reading with AVLetters Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • AVLetters: • 26-way Letter Classification • 10 Speakers • 60x80 pixels lip regions • Cross-modality learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Lip-reading with CUAVE Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • CUAVE: • 10-way Digit Classification • 36 Speakers • Cross Modality Learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Multimodal Recognition Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng • CUAVE: • 10-way Digit Classification • 36 Speakers • Evaluate in clean and noisy audio scenarios • In the clean audio scenario, audio performs extremely well alone
Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Method: Learned Features + Canonical Correlation Analysis
McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Video Reconstruction Conclusion Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Learned Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Video Input Audio Input Video Input • Applied deep autoencoders to discover features in multimodal data • Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue • Multimodal Feature Learning: Learn representations that relate across audio and video data Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Learning with RBMs Hidden Units ... ... …... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng