Audiovisual Speech Analysis

Audiovisual Speech Analysis Ouisper Project - Silent Speech Interface

Ouisper1 - Silent Speech Interface • Sensor-based system allowing speech communication via standard articulators, but without glottal activity • Two distinct types of application • alternative to tracheo-oesophagal speech (TES) for persons having undergone a tracheotomy • a "silent telephone" for use in situations where quiet must be maintained, or for communication in very noisy environments • Speech Synthesis from ultrasound and optical imagery of the tongue and lips 1) Oral Ultrasound synthetIc SPEech souRce

Ultrasound video of the vocal tract Optical video of the speaker lips Recorded audio Text TRAINING Visual Feature Extraction Speech Alignment Audio-Visual Speech Corpus Visual Data TEST N-best Phonetic or ALISP Targets Visual Speech Recognizer Visual Unit Selection Audio Unit Concatenation Ouisper - System Overview

Ouisper - Training Data

Ouisper - Video Stream Coding Build a subset of typical frames Perform PCA Eigenvectors Code new frames with their projections onto the set of Eigenvectors T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.

Ouisper - Audio Stream Coding Corpus-based synthesis Need of a preliminary segmental description of the signal ALISP Segmentation Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques Phonetic Segmentation Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal

Audiovisual dictionary building • Visual and acoustic data are synchronously recorded • Audio segmentation is used to bootstrap visual speech recognizer Audiovisual dictionary

Visuo-acoustic decoding • Visual speech recognition • Train HMM model for each visual class • Use multistream-based learning techniques • Perform a « visuo-phonetic » decoding step • Use N-Best list • Introduce linguistic constraints • Language model • Dictionary • Multigrams • Corpus-based speech synthesis • Combine probabilistic and data-driven approach in the audiovisual unit selection step.

Speech recognition from video-only data Open your book to the first page ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh Ref Rec ax w ih y uh r b uh k sh uw dh ax v er s p ey jh A wear your book shoe the verse page Corpus-based synthesis driven by predicted phonetic lattice is currently under study

Ouisper - Conclusion • More information on • http://www.neurones.espci.fr/ouisper/ • Contacts • gerard.chollet@enst.fr • denby@ieee.org • hueber@ieee.org

Audiovisual Speech Analysis

Audiovisual Speech Analysis

Presentation Transcript

Tools for Speech Analysis

Speech analysis with Praat

Speech Analysis: (Brutus)

Speech Analysis

AUDIOVISUAL HARMONISATION

Persuasive Speech Analysis

audiovisual technologies

Speech Analysis

Audiovisual speech perception in L1 and L2: an fMRI study

Kennedy speech analysis

Audiovisual preservation

AUDIOVISUAL SEGMENT

Chapter 6 Audiovisual Speech Perception

AVICAR: Audiovisual Speech Recognition in a Car