1 / 51

Computer-based Lip-reading using Motion Templates

Computer-based Lip-reading using Motion Templates. Wai Chee Yau PhD graduate School of Electrical and Computer Engineering, RMIT University waichee@ieee.org PhD supervisor : A. Prof. Dinesh K. Kumar. Outline. Overview of speech recognition Non audio speech modalities

oki
Download Presentation

Computer-based Lip-reading using Motion Templates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer-based Lip-reading using Motion Templates WaiCheeYau PhD graduate School of Electrical and Computer Engineering, RMIT University waichee@ieee.org PhD supervisor : A. Prof. Dinesh K. Kumar

  2. Outline • Overview of speech recognition • Non audio speech modalities • Motivation for visual speech recognition • Related work • Contributions of this research • Mouth Motion Segmentation • Feature Extraction • Classification • Temporal segmentation of utterances • Experiments • Discussion • Conclusions 16th Feb 2009

  3. Overview of Speech Recognition Speech recognition : Spoken words Computer inputs /commands Audio signals (voice) : most commonly used input for speech recognition systems Audio speech recognition applications: • speech-to-text application • voice dialing of mobile phones • call routing • aircraft control inside the pilot cockpit 16th Feb 2009

  4. Overview of Speech Recognition Advantages : Natural and convenient communication method Suitable for disabled users that cannot control their hands to use the computer Useful for ‘hands-eyes-busy’ situations • Control of the car radio or navigation system while driving a car • Control of heavy machineries in factories Problems of audio speech recognition: Affected by environmental noise • Inside a moving vehicle, noisy factories, pilot cockpits Sensitive to speaking styles • Whisper and speaking loud 16th Feb 2009

  5. Overview of Speech Recognition Possible solutions : • by using noise robust techniques such as • implementation of microphone arrays (Brandstein and Ward, 2001; Sullivan and Stern, 1993) • noise adaptation algorithms (Gales and Young, 1992;Stern et al., 1996) - by using non audio speech modalities 16th Feb 2009

  6. Non Audio Speech Modalities Visual : videos and images Infra-red camera mounted on headset In car control system • G. Potamianos, “Audio- visual speech processing : Progress and Challenges”, VisHCI, Canberra Australia, Nov. 2006 16th Feb 2009

  7. Non Audio Speech Modalities Muscles activity signals Control of Mars Rover NASA Ames Research Lab Lipreading mobile phone NTT DoCoMo • “Voiceless Recognition Without the Voice”, May. 1, 2004 Issue of CIO Magazine C Jorgensen, D. D. Lee and S. Agabon, “Sub auditory recognition based on EMG/EPG signals”, 2003. 16th Feb 2009

  8. Non Audio Speech Modalities Brain signals Facial EMG signals for English and German vowel recognition SP Arjunan, D.K Kumar, W.C Yau, H. Weghorn, “Unspoken Vowel Recognition Using Facial Electromyogram”, EMBC, New York 2006. http://www.eng.nus.edu.sg/EResnews/0202/rd/rd_4.html 16th Feb 2009

  9. Non Audio Speech Modalities Advantages of Voice-less Speech Recognition: User can control the system without making a sound : • Uttering pin code of security system • Defence or military applications • Control of computers/machines for disabled user with speech impairments When the audio signals is greatly affected by noise: • Control of car radio in a vehicle 16th Feb 2009

  10. Motivation for Using Visual Speech Recognition Advantages of visual approach • Non invasive • Do not have to place sensors on the users • Cameras commonly available 16th Feb 2009

  11. Motivation for Using Visual Speech Recognition Is he saying Ba? Ga? Da? http://www.media.uio.no/personer/arntm/McGurk_english.html 16th Feb 2009

  12. Motivation for Visual Speech Recognition • McGurk effect (McGurk 1976) : we combine what we ‘see’ with what we ‘hear’ in understanding speech Sound /ba/ + lip movement /ga/ 98% of us perceive as /da/ • People with hearing impairment can lip-read by looking at the mouth of the speaker 16th Feb 2009

  13. Block diagram of a lip-reading system 16th Feb 2009

  14. Related Work Visual speech recognition techniques reported in the literature can be broadly divided into : • Appearance-based (Potamianos et. al. 2004) • Uses image pixels in the surrounding mouth region • Shape-based (Petajan 1984, Adjoudani et. al. 1996) • Uses the shape information of the mouth/lips • Motion-based (Mase & Pentland 1991) • Describes the mouth movements height width Appearance -based Shape-based 16th Feb 2009

  15. Related Work Few have focused on motion features. Human perceptual studies indicate that dynamic information is important for visual speech perception (Rosenblum & Saldaa, 1998) 16th Feb 2009

  16. Contributions of this research This research: • proposes new motion features for computer-based lip reading using motion templates (MT) • investigates the use of Zernike moments to derive rotational invariant features • compares the performance of hidden Markov models (HMM) and support vector machines (SVM) for classification of the motion features 16th Feb 2009

  17. Mouth movement segmentation Motion templates (MT) are 2D grayscale images (Bobick et. al. 2001) where the: • Intensity values indicate ‘when’ the movements occurred • Pixel location indicate ‘where’ the movements happened Step 1 : Compute the difference of frames (DOF) Step 2 : Convert the DOFs into binary images Step 3 : Temporally integrate the binary images with linear ramp of time 16th Feb 2009

  18. Mouth movement segmentation • Removes the static elements & preserve the short duration facial movements • Invariant within limits to skin colour • The intensity values of MT are normalized to reduce the changes in speed of speech • Histogram equalization is applied on the MT to minimize the global changes in lighting conditions 16th Feb 2009

  19. Feature extraction • 2 types of features investigated: DCT coefficients • commonly used as appearance-based features in visual speech recognition Zernike moments • a type of image moments • novel features for visual speech recognition 16th Feb 2009

  20. Zernike moments • Advantages of Zernike moments : • Selected as one of the robust feature descriptor in MPEG-7 (Jeannin 2000) • Rotation invariant • Robust and good image representation (Teh 1988) • Computed by projecting the image function onto the orthogonal Zernike polynomial • Before computing ZM from MT, the MT needs to be mapped to a unit circle 16th Feb 2009

  21. Zernike moments Mapping of MT to a unit circle 16th Feb 2009

  22. Zernike moments Zernike Moments: Zernike polynomial: Normalizing constant: Radial polynomials: 16th Feb 2009

  23. Zernike moments • Zernike moments computed from an image function • Magnitude of ZM is rotational invariant 16th Feb 2009

  24. Zernike moments 16th Feb 2009

  25. DCT features • 2-D DCT produces a compact energy representation of an image • Focuses energy on the top left corner of the image 16th Feb 2009

  26. DCT features • For an M x N image, f(x, y), DCT coefficients : • DCT features are shown to outperform DWT and PCA (Potamianos 2000) for visual speech recognition 16th Feb 2009

  27. Classification • Assigning new feature vectors to one of the pre-defined utterances • Two types of classifiers evaluated : • Generative models : hidden Markov models (HMM) • Discriminative classifier : support vector machines (SVM) 16th Feb 2009

  28. SVM classifier • Supervised classifiers trained using learning algorithm from statistical learning theory • Successfully implemented for different image object recognition tasks • Can find the optimal hyper plane between classes in sparse high-dimensional spaces with relatively few training data • SVM with RBF kernel are used in the experiments for classifying the motion features (Zernike moments and DCT coefficients) 16th Feb 2009

  29. HMM classifier • assumes that the speech signals contain short time segments that are stationary. • models these short periods where the signals are steady • the changes between states are represented as transitions of states in HMM. • the temporal variations within each of these segments are assumed to be statistical. 16th Feb 2009

  30. HMM Classifier • The motion features are assumed to be Gaussian distributed and modelled as continuous observation densities • Each phone is modelled as a left-right HMM with 3 states and diagonal covariance matrix. This HMM structure has been demonstrated to be suitable for modelling English phonemes (Foo & Dong 2002) • Baum-Welch algorithm is used in the HMM training for re-estimation of the HMM parameters 16th Feb 2009

  31. HMM Classifier • Recognition phase : • compute the likelihoods of each of the HMM to produce the test sample • The test sample is classified as the phoneme with HMM producing the highest likelihoods 16th Feb 2009

  32. Temporal Segmentation of Utterances Temporal segmentation of utterances are usually achieved using audio signals.

  33. Temporal Segmentation of Utterances This method combines motion + mouth appearance information Motion information : • 3-frame MT are computed for a sequence of image. • The average energy of 3-frame MT represent the magnitude of movement

  34. Visual Utterance Segmentation Mouth appearance information : • A kNN classifier (k=3) is trained to recognize 2 classes: • mouth appearance of uttering a phoneme (speaking) • mouth appearance of silence • Trained using mouth images of when the talker is speaking and when he/she is maintaining silence. Examples of ‘silence’ images Examples of ‘speaking’ images

  35. Utterance Segmentation Algorithm START Compute the next 3-frame MT Mouth movement present? no yes no no Previous frames = silence Previous frames = speaking no yes yes no Following frames = speaking Following frames = silence End of utterance Start of utterance

  36. Experiments • Experiment 1 : Compare the performance of Zernike moments and DCT features • Experiment 2 : Compare the performance of HMM and SVM • Experiment 3 : Evaluate the performance of the proposed temporal segmentation approach 16th Feb 2009

  37. Vocabulary • Recognition units : visemes (basic unit of facial movement during the articulation of a phoneme) • Visemes defined in MPEG 4 standard is used 16th Feb 2009

  38. Experimental Setup Video recording and Processing: • Recorded using a web camera in an office environment • Frontal view of the mouth of 10 speaker (5 males and 5 females). Constant view angle. • A total of 2800 utterances were recorded as AVI files of 320x240. Frame rate : 30 frames/sec 16th Feb 2009

  39. Experimental Setup • 1 MT was generated from grayscale images of each phoneme • Histogram equalization was applied on the images to reduce the effects of illumination variations • The images were analysed and processed using MATLAB 7 • LIBSVM toolbox (Chang and Lin 2001) was used to create the SVM classifier and HMM toolbox for Matlab (Murphy 1998) was used to design the HMM classifier. 16th Feb 2009

  40. MT of 14 visemes 16th Feb 2009

  41. Experiments • Zernike moments and DCT coefficients are computed from the MTs • 64 Zernike moments are used to represent each MT. Same number of DCT coefficients are used to form feature vectors 16th Feb 2009

  42. Experiment 1 : Comparison of ZM and DCT features • The features are classified using SVM with RBF kernel. • The SVM parameters are determined through 5-fold cross validation on the training data • Leave-1-out method for testing Average recognition rates DCT : 99% ZM : 97.4% 16th Feb 2009

  43. Experiment 1 : Comparison of ZM and DCT features • Sensitivity to illumination variations testing • Trained with original lighting images and tested with reduced/increased 30% brightness • Average recognition rates • DCT = 100% • Zernike moments = 100% 16th Feb 2009

  44. Experiment 1 : Comparison of ZM and DCT features • Recognition rates of ZM and DCT features to rotational changes 16th Feb 2009

  45. Experiment 1 : Comparison of ZM and DCT features • Sensitivity analysis of ZM and DCT to image noise 16th Feb 2009

  46. Experiment 2 : Comparison of SVM and HMM classifier • Single stream , left-right HMM with 3 states used to model each viseme • SVM and HMM trained and tested using ZM and DCT features of Participants 1 ( 280 utterances was used in this experiment) • Leave-1-out method • Average recognition rates: • HMM : 95.0% • SVM : 99.5% 16th Feb 2009

  47. Experiment 3 : Results for Temporal Segmentation of Utterances • 98.6% accuracy • 276 (out of a total of 280 phonemes) phonemes were correctly segmented (4 errors)

  48. Discussion • The results demonstrate the efficacy of the proposed motion features in visual speech recognition • DCT and Zernike moments produce high accuracy in classifying 14 visemes using SVM. • The proposed technique is demonstrated to be invariant to global changes in illumination • Zernike moments are demonstrated to be invariant to rotational changes up to 20 degrees whereas DCT is sensitive to such rotation. 16th Feb 2009

  49. Discussion • DCT features has better tolerance towards image noise as compared to ZM features • Possible reasons for SVM to outperform HMM in classifying the motion features : • STT eliminates the need for temporal modelling • The training dataset is not large • One possible reason for misclassification is the occlusion of the articulators movements. • The accuracies is higher as compared to the results reported by Foo et. al. (2002) (88.5%) using static features tested on the same vocabulary. 16th Feb 2009

  50. Conclusions • This research evaluated a novel approach for visual speech recognition using motion templates. The proposed technique is demonstrated to be useful for English phoneme recognition • DCT features are found to be sensitive to rotational changes whereas Zernike moments are rotation invariant • Motion segmentation technique used in the proposed technique eliminates the need for temporal modeling of the features for phoneme classification. Hence, SVM can be used to recognize the motion features and outperforms HMM • The efficacy of the proposed temporal segmentation approach using motion and appearance information is demonstrated. 16th Feb 2009

More Related