170 likes | 301 Views
A Temporal Network of Support Vector Machines for the Recognition of Visual Speech. Mihaela Gordan * , Constantine Kotropoulos ** , Ioannis Pitas ** * Faculty of Electronics and Telecommunications Technical University of Cluj-Napoca 15 C. Daicoviciu, 3400 Cluj-Napoca, Romania
E N D
A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan*, Constantine Kotropoulos**, Ioannis Pitas** *Faculty of Electronics and Telecommunications Technical University of Cluj-Napoca 15 C. Daicoviciu, 3400 Cluj-Napoca, Romania **Department of Informatics, Aristotle University of Thessaloniki Artificial Intelligence and Information Analysis Laboratory GR-54006 Thessaloniki Box 451, Greece This work was supported by the European Union Research Training Network ``Multi-modal Human-Computer Interaction (HPRN-CT-2000-00111)'' Department of Informatics Aristotle University of Thessaloniki
Brief Overview • Visual speech recognition (lipreading): important component of audiovisual speech recognition systems; emerging research field. • Support vector machines (SVMs): powerful classifiers for various visual classification tasks (face recognition; medical image processing; object tracking) • Goal of this work:• to examine the suitability of using SVMs for visual speech recognition, • by developing an SVM-based visual speech recognition system. • In brief:• we use SVMs for viseme recognition & • integrate them as nodes in a Viterbi decoding lattice • The good results: slightly higher WRR for very simple input features; possibility of easy generalization to larger vocabulary tasks, encourage the continuation of research. Department of Informatics Aristotle University of Thessaloniki
Contents 1. State of the art & research trends 2. Principles of the proposed visual speech recognition approach 3. SVMs and their use for mouth shape recognition 4. Modeling the temporal dynamics of visual speech 5. Block diagram of the proposed visual speech recognition system 6. Experimental results 7. Conclusions Department of Informatics Aristotle University of Thessaloniki
1. State of the art & research trends • Visual speech recognition = recognize the spoken words based on visual examination of speaker’s face only, mainly mouth area. • State of the art for visual speech recognition: many methods reported, very different in respect to: • the feature types (lip contour coordinates, GLDP, gray levels of mouth image); • the classifier used(TDNN, HMM); • the class definition. • Active research trends in the area: • Find the most suitable features and classification techniques for efficient discrimination between different mouth shapes, individual-independent • Reduce the required processing of the mouth image to increase the speed; • Find solutions to facilitate easy integration of audio and visual recognizer. • Use of SVMs in speech recognition: recently employed in audio speech recognition with very good results; no attempts in visual speech recognition. Department of Informatics Aristotle University of Thessaloniki
“o” “f” 2. Principles of the proposed visual speech recognition approach - I • Visemes = basic units of visual speech basic shapes of the mouth during speech production. • Discrimination between visemes pattern recognition problem: • Feature vector = a representation of the mouth image (e.g. at pixel level: gray levels of the pixels in the mouth image scanned in raw order); • Pattern classes = the different visemes (mouth shapes) during the pronunciation of the words from the dictionary. Department of Informatics Aristotle University of Thessaloniki
2. Principles of the proposed visual speech recognition approach - II • The proposed strategy: Having a given visual speech recognition task (i.e. a given dictionary of words), • Find the phonetic description of each word; • Derive the viseme-to- phoneme mapping according to the application (will be one-to-many, due to the involvement of non-visible parts of vocal tract in speech production & dependent to the nationality of the speaker; no universal viseme-to-phoneme mapping currently available); • Use the phonetic words descriptions and the viseme-to-phoneme mapping to derive visemic words descriptions ( visemic models = sequences of mouth shapes that could produce the phonetic word realization). Department of Informatics Aristotle University of Thessaloniki
2. Principles of the proposed visual speech recognition approach - III Phonetic and visemic word description models Viseme-to-phoneme mapping Department of Informatics Aristotle University of Thessaloniki
3. SVMs and their use for mouth shape recognition - I • SVMs = statistical learning classifiers based on optimal hyperplane algorithm: • Minimize a bound on the empirical error & the complexity of the classifier • Capable of learning in sparse high-dimensional spaces with few training examples. • Classical SVMs solve 2-class pattern recognition problems: = training examples; = M-dimensional pattern - indicates if example i is a negative / positive example • Linear SVMs: the data to be classified are separable in their original domain Department of Informatics Aristotle University of Thessaloniki
= · 3. SVMs and their use for mouth shape recognition - II • Nonlinear SVMs: the data to be classified are not separable in their original domain • • We project the data in a higher dimensional Hilbert space, , where the data are linearly separable, via the nonlinear mapping • and express the dot product of the data by a kernel function: • the decision function of the SVM classifier is: where: = the non-negative Lagrange multipliers associated with the QP aiming to maximize the distance between classes and the separating hyperplane; , = hyperplane’s parameters. Department of Informatics Aristotle University of Thessaloniki
3. SVMs and their use for mouth shape recognition - III • Thereal valued output function of the SVM gives the degree of confidence in the class assignment. • SVM = binary classifier need to train one SVM for each mouth shape (viseme). • The features used: the gray levels of pixels in the mouth image scanned in raw order. • The set of training patterns = common to all SVMs; just the labels assigned to each training pattern are different. Use only unambiguous positive & negative examples. • Training patterns (mouth images) are preprocessed for normalization in respect to scale, translation and rotation. Department of Informatics Aristotle University of Thessaloniki
“one” = 4. Modeling the temporal dynamics of visual speech - I • Symbolic visemic description of a word = L-R sequence of visemes; no information about the relative duration of each viseme in the word realization (strongly person-dependent) • Given: • the symbolic visemic description of a word • the total number of frames in the word pronunciation build the word model in the temporal domain by assuming any non-zero possible duration of each viseme = a temporal network of models for each symbolic visemic description, as a Viterbi lattice. Department of Informatics Aristotle University of Thessaloniki
OUT Viterbi lattice d for the visemic word model wd; T=5 Node k+1 Sub-path i Node k IN 4. Modeling the temporal dynamics of visual speech - II Department of Informatics Aristotle University of Thessaloniki
4. Modeling the temporal dynamics of visual speech - III • Node k = the measure of confidence in the realization of the viseme ok=“ah” at the timeframe tk=3. = the real-valued output of the SVM trained for the recognition of the viseme ok. • Sub-path i = the transition probability from the state which generates ok=“ah” at timeframe tk=3 to the state which generates ok+1=“n” at timeframe tk+1=4. We assume equal transition probabilities. • Path l = any connected path between the states IN and OUT in the Viterbi lattice. • Confidence in path l from the Viterbi lattice d: • Plausibility of producing the word model wd: Department of Informatics Aristotle University of Thessaloniki
SVMn SVMah SVMoa c1 SVMoa “one” c2 n r n i=arg max cd ah ah ao oa f w . . . . . . . “one” . . . cD Result: i=1 Word “one” “four” 5. Block diagram of the proposed visual speech recognition system Department of Informatics Aristotle University of Thessaloniki
6. Experimental results - I • Task to be solved: visual speech recognition of the first four digits in English • Experimental data: the visual part from Tulips1 audiovisual speech database • Implementation: • in C++, using the publicly available SVMLight toolkit • writing the code for the Viterbi algorithm and additional modules and integrating them into the visual speech recognizer • Training strategy: 12 SVMs (one for each viseme class) with polynomial kernel, degree 3, C=1000. • Test strategy: leave-one-out protocol train the system 12 times on 11 subjects, each time leaving out one subject for testing 24 test sequences/ word 4 words = 96 test sequences. • Performance evaluation: in terms of: • Overall (average) WRR – compared to the similar results from literature; • 95% confidence intervals for the WRR of the proposed approach and for WRR of similar approaches from literature. Department of Informatics Aristotle University of Thessaloniki
6. Experimental results - II • Comparison: • Slightly higher WRR and confidence intervals compared to the literature • Exception: lower WRR than the best reported without delta features (87.5%), due to a much better localization of the ROI around lip contour in that case. However – our computational complexity is much lower (no need to redefine the ROI in each frame). Department of Informatics Aristotle University of Thessaloniki
7. Conclusions • We examined the suitability of SVM classifiers for visual speech recognition. • The temporal character of speech was modeled by integrating SVMs with real valued output as nodes in a Viterbi decoding lattice. • Performance evaluation of the system on a small visual speech recognition task show: • better WRR than the ones reported in literature, • even for the use of very simple features: directly the gray levels in the mouth image • SVMs = promising tool for visual speech recognition applications. • Future research’s goals: increase the WRR by: • including delta features; examining other SVM’s kernels; learning the state transition probabilities in the Viterbi decoding lattice Department of Informatics Aristotle University of Thessaloniki