1 / 34

An Automatic Lip-reading Method Based on Polynomial Fitting

An Automatic Lip-reading Method Based on Polynomial Fitting. Meng LI Supervisor: Dr. Yiu-ming CHEUNG Department of Computer Science Hong Kong Baptist University. Content. Introduction. Lip segmentation. Visual speech recognition. Experiment. Conclusion and future work. Introduction.

jayden
Download Presentation

An Automatic Lip-reading Method Based on Polynomial Fitting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Automatic Lip-reading Method Based on Polynomial Fitting Meng LI Supervisor: Dr. Yiu-ming CHEUNG Department of Computer Science Hong Kong Baptist University

  2. Content Introduction Lip segmentation Visual speech recognition Experiment Conclusion and future work

  3. Introduction The speech perception is multimodal involves information from at least two sensory modalities. Audio Channel Video Channel Perception

  4. Introduction Silent Environment Visual Only 73 % Audio Only 91% Visual-Audio 97% 0% 20% 40% 60% 80% 100% Noisy Environment Visual Only 73 % Audio Only 47% Visual-Audio 87% 0% 20% 40% 60% 80% 100%

  5. Introduction Speech recognition in noisy environment Visual-only speech recognition Identification Others The hottest research direction in lip-reading is visual-speech recognition (with audio information, or visual only) 63% 31% 1% 5%

  6. Introduction The basic structure of an typical AVSR (Automatic Visual-Speech Recognition) system Preprocessing Feature Extraction AV Fusion Video Audio Audio Feature Extraction Acoustic Processing Fusion and Recognition Visual Feature Extraction Lip Capturing

  7. Introduction Using all pixels in lip region as feature. Capture the moving feature in all or parts of lip during pronunciation Pixel Based Motion Based Shape Based Model Based Extract the boundary of lip as the feature. Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.

  8. Introduction Using all pixels in lip region as feature. Capture the moving feature in all or parts of lip during pronunciation Pixel Based Motion Based Shape Based Model Based Extract the boundary of lip as the feature. Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.

  9. Introduction Positive Advantage Disadvantage • All information are utilized. • Highest recognition in ideal illumination condition. • Sensitive to the illumination condition. • Sensitive to the rotate, scale transform. • Human dependence. • High dimension of feature data.

  10. Introduction Using all pixels in lip region as feature. Capture the moving feature in all or parts of lip during pronunciation Motion Based Pixel Based Shape Based Model Based Extract the boundary of lip as the feature. Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.

  11. Introduction Positive Negative Disadvantage Advantage • Represent the motion of lip directly and completely. • Sensitive to the illumination condition. • Sensitive to the rotate, scale transform. • Human dependence. • High dimension of feature data.

  12. Introduction Using all pixels in lip region as feature. Capture the moving feature in all or parts of lip during pronunciation Pixel Based Moving Based Shape Based Model Based Extract the boundary of lip as the feature. Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.

  13. Tip So far, the Model-based Feature Extraction is the most common method. Introduction Positive Negative Disadvantage Advantage • Low dimension of feature data. • Robust to rotate and scale transformation. • If the model appropriate, human independence ca be implemented. • Convenient to employ some classical method (e.g. HMM) to match. • High computation complexity.

  14. Introduction

  15. Introduction The rest of this presentation. Lip segmentation under gray-level • Based on gray-level image. • Locate the minimum enclosing rectangular of mouth. • High processing speed. • Low computation complexity.

  16. Introduction The rest of this presentation. Lip segmentation in colour space • Based on rgb, hsv and La*b* colour space. • Can extract the outer boundary of lip. • High accuracy. • High computation complexity.

  17. Introduction The rest of this presentation. Visual only speech recognition • Based on polynomial fitting. • High processing speed. Suitable for real-time system. • Perform good in limited training set.

  18. Lip segmentation (1)

  19. Lip segmentation (1)

  20. Lip segmentation (1)

  21. Lip segmentation (2) Firstly, we transform the source image from RGB color space into La*b* space. In a* channel, negative values indicate green while positive values indicate magenta. So, it is helpful to highlight the lip region from skin.

  22. Lip segmentation (2)

  23. Lip segmentation (2) In source image, we get the pixels located in the non-black area, and transform them into HSV color space. Then, we can get a vector as follow: We assume the data follow a normal distribution, and estimate the mean and variance via ML:

  24. Lip segmentation (2) We can transform the source image into HSV color space, and get the vector as follow: Then, we can get a new image: The lighter pixel means it is similar to lip region in color space.

  25. Lip segmentation (2) We select the block in which include the “gravity center” as the lip region.

  26. Visual speech recognition

  27. Visual speech recognition For each utterance, we can get two curves correspond into the changing of width and height of lip, respectively. We can employ LSE to construct two polynomial to fit the two curves.

  28. Visual speech recognition In this work, we get n=3. The maximum, minimum and the most right point is recorded as the feature vectors. Each utterance is assigned a label “j”, and we use the following equations to train: We use the following equations to test (F is the input feature vector, and T is the trained template feature vector):

  29. Experiment The illumination source is an 18w fluorescent lamp, the resolution of camera is 320*240, FPS = 30, and the entire environment is shown as below. Our task is to recognize 10 isolate digits (0 to 9) in Chinese mandarin. There are 5 speakers (4 males and 1 female) take part into the experiment. For each digit, speakers were asked to repeat 10 times to train the system, and fifty times to test.

  30. Experiment The experiment result is shown as below:

  31. Experiment Compare with some existed approaches which also utilize the width and height of lip as visual feature: 1,2 and 3: S.L.Wang, W.H.Lau, A.W.C.Liew, and S.H.Leung. Automatic lipreading with limited training data. In Proc. ICPR 2006, pp: 881-884, 2006. 4: A.R.Baig, R.Seguier, and G. Vaucher. Image sequence analysis using a spatio-temporal coding for automatic lipreading. In Porc. ICIAP 1999, pp: 544-549, 1999.

  32. Experiment

  33. Conclusion & Future work In this paper, we have proposed a new approach to automatic lip reading recognition based upon polynomial fitting. The feature vector of our approach have low dimensions and the approach need small testing data set. Experiments have shown the promising result of the proposed approach in comparison with the existing methods. However, in the more difficult experiment task, e.g. to recognize some words or sentences, some appropriate model is required. This is the emphasis of the next stage research.

  34. Thank you! 31-08-2009

More Related