100 likes | 258 Views
Audiovisual Speech Analysis. Ouisper Project - Silent Speech Interface. Ouisper 1 - Silent Speech Interface. Sensor-based system allowing speech communication via standard articulators, but without glottal activity Two distinct types of application
E N D
Audiovisual Speech Analysis Ouisper Project - Silent Speech Interface
Ouisper1 - Silent Speech Interface • Sensor-based system allowing speech communication via standard articulators, but without glottal activity • Two distinct types of application • alternative to tracheo-oesophagal speech (TES) for persons having undergone a tracheotomy • a "silent telephone" for use in situations where quiet must be maintained, or for communication in very noisy environments • Speech Synthesis from ultrasound and optical imagery of the tongue and lips 1) Oral Ultrasound synthetIc SPEech souRce
Ultrasound video of the vocal tract Optical video of the speaker lips Recorded audio Text TRAINING Visual Feature Extraction Speech Alignment Audio-Visual Speech Corpus Visual Data TEST N-best Phonetic or ALISP Targets Visual Speech Recognizer Visual Unit Selection Audio Unit Concatenation Ouisper - System Overview
Ouisper - Video Stream Coding Build a subset of typical frames Perform PCA Eigenvectors Code new frames with their projections onto the set of Eigenvectors T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.
Ouisper - Audio Stream Coding Corpus-based synthesis Need of a preliminary segmental description of the signal ALISP Segmentation Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques Phonetic Segmentation Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal
Audiovisual dictionary building • Visual and acoustic data are synchronously recorded • Audio segmentation is used to bootstrap visual speech recognizer Audiovisual dictionary
Visuo-acoustic decoding • Visual speech recognition • Train HMM model for each visual class • Use multistream-based learning techniques • Perform a « visuo-phonetic » decoding step • Use N-Best list • Introduce linguistic constraints • Language model • Dictionary • Multigrams • Corpus-based speech synthesis • Combine probabilistic and data-driven approach in the audiovisual unit selection step.
Speech recognition from video-only data Open your book to the first page ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh Ref Rec ax w ih y uh r b uh k sh uw dh ax v er s p ey jh A wear your book shoe the verse page Corpus-based synthesis driven by predicted phonetic lattice is currently under study
Ouisper - Conclusion • More information on • http://www.neurones.espci.fr/ouisper/ • Contacts • gerard.chollet@enst.fr • denby@ieee.org • hueber@ieee.org