570 likes | 1.2k Views
Computer-based Lip-reading using Motion Templates. Wai Chee Yau PhD graduate School of Electrical and Computer Engineering, RMIT University waichee@ieee.org PhD supervisor : A. Prof. Dinesh K. Kumar. Outline. Overview of speech recognition Non audio speech modalities
E N D
Computer-based Lip-reading using Motion Templates WaiCheeYau PhD graduate School of Electrical and Computer Engineering, RMIT University waichee@ieee.org PhD supervisor : A. Prof. Dinesh K. Kumar
Outline • Overview of speech recognition • Non audio speech modalities • Motivation for visual speech recognition • Related work • Contributions of this research • Mouth Motion Segmentation • Feature Extraction • Classification • Temporal segmentation of utterances • Experiments • Discussion • Conclusions 16th Feb 2009
Overview of Speech Recognition Speech recognition : Spoken words Computer inputs /commands Audio signals (voice) : most commonly used input for speech recognition systems Audio speech recognition applications: • speech-to-text application • voice dialing of mobile phones • call routing • aircraft control inside the pilot cockpit 16th Feb 2009
Overview of Speech Recognition Advantages : Natural and convenient communication method Suitable for disabled users that cannot control their hands to use the computer Useful for ‘hands-eyes-busy’ situations • Control of the car radio or navigation system while driving a car • Control of heavy machineries in factories Problems of audio speech recognition: Affected by environmental noise • Inside a moving vehicle, noisy factories, pilot cockpits Sensitive to speaking styles • Whisper and speaking loud 16th Feb 2009
Overview of Speech Recognition Possible solutions : • by using noise robust techniques such as • implementation of microphone arrays (Brandstein and Ward, 2001; Sullivan and Stern, 1993) • noise adaptation algorithms (Gales and Young, 1992;Stern et al., 1996) - by using non audio speech modalities 16th Feb 2009
Non Audio Speech Modalities Visual : videos and images Infra-red camera mounted on headset In car control system • G. Potamianos, “Audio- visual speech processing : Progress and Challenges”, VisHCI, Canberra Australia, Nov. 2006 16th Feb 2009
Non Audio Speech Modalities Muscles activity signals Control of Mars Rover NASA Ames Research Lab Lipreading mobile phone NTT DoCoMo • “Voiceless Recognition Without the Voice”, May. 1, 2004 Issue of CIO Magazine C Jorgensen, D. D. Lee and S. Agabon, “Sub auditory recognition based on EMG/EPG signals”, 2003. 16th Feb 2009
Non Audio Speech Modalities Brain signals Facial EMG signals for English and German vowel recognition SP Arjunan, D.K Kumar, W.C Yau, H. Weghorn, “Unspoken Vowel Recognition Using Facial Electromyogram”, EMBC, New York 2006. http://www.eng.nus.edu.sg/EResnews/0202/rd/rd_4.html 16th Feb 2009
Non Audio Speech Modalities Advantages of Voice-less Speech Recognition: User can control the system without making a sound : • Uttering pin code of security system • Defence or military applications • Control of computers/machines for disabled user with speech impairments When the audio signals is greatly affected by noise: • Control of car radio in a vehicle 16th Feb 2009
Motivation for Using Visual Speech Recognition Advantages of visual approach • Non invasive • Do not have to place sensors on the users • Cameras commonly available 16th Feb 2009
Motivation for Using Visual Speech Recognition Is he saying Ba? Ga? Da? http://www.media.uio.no/personer/arntm/McGurk_english.html 16th Feb 2009
Motivation for Visual Speech Recognition • McGurk effect (McGurk 1976) : we combine what we ‘see’ with what we ‘hear’ in understanding speech Sound /ba/ + lip movement /ga/ 98% of us perceive as /da/ • People with hearing impairment can lip-read by looking at the mouth of the speaker 16th Feb 2009
Block diagram of a lip-reading system 16th Feb 2009
Related Work Visual speech recognition techniques reported in the literature can be broadly divided into : • Appearance-based (Potamianos et. al. 2004) • Uses image pixels in the surrounding mouth region • Shape-based (Petajan 1984, Adjoudani et. al. 1996) • Uses the shape information of the mouth/lips • Motion-based (Mase & Pentland 1991) • Describes the mouth movements height width Appearance -based Shape-based 16th Feb 2009
Related Work Few have focused on motion features. Human perceptual studies indicate that dynamic information is important for visual speech perception (Rosenblum & Saldaa, 1998) 16th Feb 2009
Contributions of this research This research: • proposes new motion features for computer-based lip reading using motion templates (MT) • investigates the use of Zernike moments to derive rotational invariant features • compares the performance of hidden Markov models (HMM) and support vector machines (SVM) for classification of the motion features 16th Feb 2009
Mouth movement segmentation Motion templates (MT) are 2D grayscale images (Bobick et. al. 2001) where the: • Intensity values indicate ‘when’ the movements occurred • Pixel location indicate ‘where’ the movements happened Step 1 : Compute the difference of frames (DOF) Step 2 : Convert the DOFs into binary images Step 3 : Temporally integrate the binary images with linear ramp of time 16th Feb 2009
Mouth movement segmentation • Removes the static elements & preserve the short duration facial movements • Invariant within limits to skin colour • The intensity values of MT are normalized to reduce the changes in speed of speech • Histogram equalization is applied on the MT to minimize the global changes in lighting conditions 16th Feb 2009
Feature extraction • 2 types of features investigated: DCT coefficients • commonly used as appearance-based features in visual speech recognition Zernike moments • a type of image moments • novel features for visual speech recognition 16th Feb 2009
Zernike moments • Advantages of Zernike moments : • Selected as one of the robust feature descriptor in MPEG-7 (Jeannin 2000) • Rotation invariant • Robust and good image representation (Teh 1988) • Computed by projecting the image function onto the orthogonal Zernike polynomial • Before computing ZM from MT, the MT needs to be mapped to a unit circle 16th Feb 2009
Zernike moments Mapping of MT to a unit circle 16th Feb 2009
Zernike moments Zernike Moments: Zernike polynomial: Normalizing constant: Radial polynomials: 16th Feb 2009
Zernike moments • Zernike moments computed from an image function • Magnitude of ZM is rotational invariant 16th Feb 2009
Zernike moments 16th Feb 2009
DCT features • 2-D DCT produces a compact energy representation of an image • Focuses energy on the top left corner of the image 16th Feb 2009
DCT features • For an M x N image, f(x, y), DCT coefficients : • DCT features are shown to outperform DWT and PCA (Potamianos 2000) for visual speech recognition 16th Feb 2009
Classification • Assigning new feature vectors to one of the pre-defined utterances • Two types of classifiers evaluated : • Generative models : hidden Markov models (HMM) • Discriminative classifier : support vector machines (SVM) 16th Feb 2009
SVM classifier • Supervised classifiers trained using learning algorithm from statistical learning theory • Successfully implemented for different image object recognition tasks • Can find the optimal hyper plane between classes in sparse high-dimensional spaces with relatively few training data • SVM with RBF kernel are used in the experiments for classifying the motion features (Zernike moments and DCT coefficients) 16th Feb 2009
HMM classifier • assumes that the speech signals contain short time segments that are stationary. • models these short periods where the signals are steady • the changes between states are represented as transitions of states in HMM. • the temporal variations within each of these segments are assumed to be statistical. 16th Feb 2009
HMM Classifier • The motion features are assumed to be Gaussian distributed and modelled as continuous observation densities • Each phone is modelled as a left-right HMM with 3 states and diagonal covariance matrix. This HMM structure has been demonstrated to be suitable for modelling English phonemes (Foo & Dong 2002) • Baum-Welch algorithm is used in the HMM training for re-estimation of the HMM parameters 16th Feb 2009
HMM Classifier • Recognition phase : • compute the likelihoods of each of the HMM to produce the test sample • The test sample is classified as the phoneme with HMM producing the highest likelihoods 16th Feb 2009
Temporal Segmentation of Utterances Temporal segmentation of utterances are usually achieved using audio signals.
Temporal Segmentation of Utterances This method combines motion + mouth appearance information Motion information : • 3-frame MT are computed for a sequence of image. • The average energy of 3-frame MT represent the magnitude of movement
Visual Utterance Segmentation Mouth appearance information : • A kNN classifier (k=3) is trained to recognize 2 classes: • mouth appearance of uttering a phoneme (speaking) • mouth appearance of silence • Trained using mouth images of when the talker is speaking and when he/she is maintaining silence. Examples of ‘silence’ images Examples of ‘speaking’ images
Utterance Segmentation Algorithm START Compute the next 3-frame MT Mouth movement present? no yes no no Previous frames = silence Previous frames = speaking no yes yes no Following frames = speaking Following frames = silence End of utterance Start of utterance
Experiments • Experiment 1 : Compare the performance of Zernike moments and DCT features • Experiment 2 : Compare the performance of HMM and SVM • Experiment 3 : Evaluate the performance of the proposed temporal segmentation approach 16th Feb 2009
Vocabulary • Recognition units : visemes (basic unit of facial movement during the articulation of a phoneme) • Visemes defined in MPEG 4 standard is used 16th Feb 2009
Experimental Setup Video recording and Processing: • Recorded using a web camera in an office environment • Frontal view of the mouth of 10 speaker (5 males and 5 females). Constant view angle. • A total of 2800 utterances were recorded as AVI files of 320x240. Frame rate : 30 frames/sec 16th Feb 2009
Experimental Setup • 1 MT was generated from grayscale images of each phoneme • Histogram equalization was applied on the images to reduce the effects of illumination variations • The images were analysed and processed using MATLAB 7 • LIBSVM toolbox (Chang and Lin 2001) was used to create the SVM classifier and HMM toolbox for Matlab (Murphy 1998) was used to design the HMM classifier. 16th Feb 2009
MT of 14 visemes 16th Feb 2009
Experiments • Zernike moments and DCT coefficients are computed from the MTs • 64 Zernike moments are used to represent each MT. Same number of DCT coefficients are used to form feature vectors 16th Feb 2009
Experiment 1 : Comparison of ZM and DCT features • The features are classified using SVM with RBF kernel. • The SVM parameters are determined through 5-fold cross validation on the training data • Leave-1-out method for testing Average recognition rates DCT : 99% ZM : 97.4% 16th Feb 2009
Experiment 1 : Comparison of ZM and DCT features • Sensitivity to illumination variations testing • Trained with original lighting images and tested with reduced/increased 30% brightness • Average recognition rates • DCT = 100% • Zernike moments = 100% 16th Feb 2009
Experiment 1 : Comparison of ZM and DCT features • Recognition rates of ZM and DCT features to rotational changes 16th Feb 2009
Experiment 1 : Comparison of ZM and DCT features • Sensitivity analysis of ZM and DCT to image noise 16th Feb 2009
Experiment 2 : Comparison of SVM and HMM classifier • Single stream , left-right HMM with 3 states used to model each viseme • SVM and HMM trained and tested using ZM and DCT features of Participants 1 ( 280 utterances was used in this experiment) • Leave-1-out method • Average recognition rates: • HMM : 95.0% • SVM : 99.5% 16th Feb 2009
Experiment 3 : Results for Temporal Segmentation of Utterances • 98.6% accuracy • 276 (out of a total of 280 phonemes) phonemes were correctly segmented (4 errors)
Discussion • The results demonstrate the efficacy of the proposed motion features in visual speech recognition • DCT and Zernike moments produce high accuracy in classifying 14 visemes using SVM. • The proposed technique is demonstrated to be invariant to global changes in illumination • Zernike moments are demonstrated to be invariant to rotational changes up to 20 degrees whereas DCT is sensitive to such rotation. 16th Feb 2009
Discussion • DCT features has better tolerance towards image noise as compared to ZM features • Possible reasons for SVM to outperform HMM in classifying the motion features : • STT eliminates the need for temporal modelling • The training dataset is not large • One possible reason for misclassification is the occlusion of the articulators movements. • The accuracies is higher as compared to the results reported by Foo et. al. (2002) (88.5%) using static features tested on the same vocabulary. 16th Feb 2009
Conclusions • This research evaluated a novel approach for visual speech recognition using motion templates. The proposed technique is demonstrated to be useful for English phoneme recognition • DCT features are found to be sensitive to rotational changes whereas Zernike moments are rotation invariant • Motion segmentation technique used in the proposed technique eliminates the need for temporal modeling of the features for phoneme classification. Hence, SVM can be used to recognize the motion features and outperforms HMM • The efficacy of the proposed temporal segmentation approach using motion and appearance information is demonstrated. 16th Feb 2009