760 likes | 794 Views
A Perspective on Speech Technology Based on Human Mechanism. Jianwu Dang Japan Advanced Institute of Science and Technology. Signal Processing. Speech communication in human and in machine. From 60’s, studies on speech spreaded: Scientific way and Engineering way.
E N D
A Perspective on Speech Technology Based on Human Mechanism Jianwu Dang Japan Advanced Institute of Science and Technology NCMMC-2009
Signal Processing Speech communication in human and in machine From 60’s, studies on speech spreaded: Scientific way and Engineering way. The former focuses on human functions, and the latter on signal processing. NCMMC-2009
Comparison of HSR and ASR • HSR is a bottom–up, divide and conquer strategy • Humans recognize speech based on a hierarchy of context layers • As in vision, the entropy decreases as we integrate context • Humans have an intrinsic robustness to noise and filtering • HSR: robust articulation; excellent context model; plenty of knowledge • ASR: bad articulation; weak context models; few knowledge NCMMC-2009
How to learn from human NCMMC-2009
Contents of this talk • Discovering and Understanding Human Functions on Speech • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009
Why human can robustly process speech? • Why can human robustly process speech even in some serious adverse environments? Hypotheses and theories • Speech chain is constructed in co-developing the functions of speech production and perception during language acquisition. • Motor theory of speech perceptiondescribesthat the acoustic signal is perceived in terms of articulatory gestures. • Topological mapping between the motor space and sensory space may be the key point for the efficiency of human speech processing. NCMMC-2009
Computational neural models (After Guenther 1996) NCMMC-2009
Human functions in speech Speech recognition/ understanding Intention/ Language Speech Chain Speech Planning Auditory-phonetic mapping Vision Articulation planning Auditory Map Differentiating Motion control Articulation/ Phonation Somatosensory receptor Perception model Aerodynamics Partner speaker Speech signal Speech communication NCMMC-2009
Experiment of transformed auditory feedback Two points: finish all processing with 30ms; keep all individual properties of the speaker NCMMC-2009
Formant difference caused by the TAF P0 is the time applying the TAF. P1 and P3 are the start and end points of the compensation. P2 is the point of the maximal compensation. NCMMC-2009
Compensation for the perturbation /i/ /e/ /a/ /u/ NCMMC-2009
Vocal tract shape of Chinese vowels NCMMC-2009
Extraction of intrinsic structure using similarity • Vocal tract shape is described by 8 points of UL, LL, LJ, T1 to T4, and the velum. • The initial vowel space consists of the vocal tract with 16 dimensions. • A similarity is measuredamongthe vowels in 16 dimensional space, and then a similarity graph is constructed for the vowels. NCMMC-2009
Distribution of articulatory place of vowels in continuous speech NCMMC-2009
Similarity based analysis • An ability to assess similarity lies close the core of cognition (Wilson, et al. 1999) • Geometric models are used in analysis of similarity • Euclidean metric (r=2) provides good fits to human similarity judgments NCMMC-2009
Construction of intrinsic space • A neighborhood keeping graphcan be obtained by minimizing the objective function • The mapping function can be obtained by solving the generalized eigenvalue as • The corresponding low dimensional embedding field can be described in NCMMC-2009
Vowel structure from read speech NCMMC-2009
3D vowel structure in articulatory space NCMMC-2009
Vowel Structure in Articulatory Space (11 vowels) NCMMC-2009
Vowel Structure in Articulatory Space(with and without lip protrusion) NCMMC-2009
Vowel Structure in Articulatory Space(with and without lip feature) NCMMC-2009
Homunculus image of the brain NCMMC-2009
Parameters for vowel structure in APS • An affine transform of a logarithmic spectrum can represent the auditory perception parameters (Wang, et al. 1995) • MFCC with 14 dim was used as acoustic parameters in the primary step, the same as that used in articulatory analysis. • Acoustic data were recorded with the articulatory data simultaneously, speech signals of the vowels are extracted from the stable period of each vowel. NCMMC-2009
Vowel structure in APS NCMMC-2009
3D vowel structure in auditory space NCMMC-2009
Comparison in 3D (Speaker 2) NCMMC-2009
Comparison in 3D (Speaker 3) NCMMC-2009
Relations between motor, sensory and articulatory spaces NCMMC-2009
Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009
Learning approaches • Distribution based learning approach Data dependent • Performance based learning approach Case dependent • What we want to learn? • Human Mechanism based learning approach • Model based learning NCMMC-2009
Human vs. model Typical phonetic target Typical phonetic target • The goal of low layer optimization: learning the planned target • The goal of high layer optimization: learning the typical targets of phonemes and coefficients of carrier model Carrier Model Planning mechanism High layer Planned target Planned target Low layer Articulation by the articulators Physiological articulatory model Observed Articulatory movements Simulated Articulatory movements B: Speech production of model A: Speech production of human NCMMC-2009
Construction of articulatory model based on MRI Extraction of Articulators Articulatory Model NCMMC-2009
Speech synthesis based physiological model Normal speech Emphasized speech NCMMC-2009
Development of the PhAM Tongue Epiglottis Mandible Thyroid cartilage Cricoid cartilage NCMMC-2009
Carrier model for coarticulation Phonetic target Ci Model sketch rci Planned target dci Ci’ Vj’ Virtual Target dvj Gi Vj Vj+1 dvj dvj+1 α β Fig.2 Based on this process, the planned targets are obtained by applying the carrier model on the typical articulatory targets Perturbation model (Öhman) Lookahead model (Henke) Carrier Model(Dang et al) NCMMC-2009
Flowchart of the low layer Typical phonetic target Typical phonetic target Look ahead mechanism Carrier Model Planned target Calculated planned target learned Planned target Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements Arrive small difference? N Tuning the planned targets Y Optimal planned targets NCMMC-2009
Flowchart of the high layer Typical phonetic target Typical phonetic target Carrier Model Look ahead mechanism N Calculated planned target Reach threshold? Planned target learned Planned target Y Optimal phonetic targets and coefficients Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements NCMMC-2009
Observation and simulation in the low layer Cross marks: observations; Diamonds: simulations. The ellipses are referred to 95% confidence interval to cover the planned targets. NCMMC-2009
Simulation result of vowels Distribution of observed and simulated articulatory movements of 5 vowels obtained via the whole framework. The blue diamonds denote the simulations NCMMC-2009
Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Investigation and Application of Individual Characteristics • Fusion of Articulatory Constraint with Speech Recognition NCMMC-2009
Factors of the speaker individuals The major factors can be classified as: • Learned factors social factors: • Dialects • Occupations,… • Inherent factors Physical aspects: • Age, Gender,… • Physiological situations,… • Morphological of speech organs,… NCMMC-2009
Individuals derived from Morphologies • VT shape varies with articulator movement and generates distinctive phonetic information • Unmovable parts of the VT gives the individual information • The invariant parts of the vocal tract • the nasal cavity, piriform fossa, laryngeal tube • The acoustic features induced by the above parts NCMMC-2009
Frontal sinus Sphenoid sinus Maxillary sinus Velum Piriform fossa Laryngeal cavity Vocal folds Details in vocal tract shapes Lips Tongue Red: movable Jaw NCMMC-2009
Morphologies of the vocal tract • The nasal and paranasal cavities (Dang, et al. 1994,1996) • The piriform fossa (Dang, et al. 1997) • The laryngeal tube concerned with F4 (Takemoto, et al. 2006) NCMMC-2009
Morphology effects on vowels NCMMC-2009
Evaluate morphological effects Speaker relevancy measurement using Fisher’s F-Ratio [Wolf, 1971] : Feature as subband spectrum. :Speech sample index. :Speaker index. NCMMC-2009
Discriminative score based on F-Ratio • Speaker relevant frequenciesare almost invariant for the five speech sessions. • Low frequency region from 50Hz to 300Hz glottis • High frequency regions from 4kHz to 5.5kHz Piriform fossa • High frequency region form 6.5kHz to 7.8kHz Consonant • Middle frequency region Linguistic information NCMMC-2009
How to design an algorithm • Enhancing the information around speaker relevant frequency regions • Two ways: • Increase the amplitude of the region • Increase the resolution of the region • What is human action? Increase the resolution Design non-uniform frequency warping algorithm to emphasize the speaker relevant frequency regions NCMMC-2009
Comparison of frequency resolutions • Uniform : Linear frequency scale (no frequency warping) • Mel : Mel frequency scale (Mel frequency warping) • Non-Uniform: Non-uniform frequency scale (non-uniform frequency warping) Speaker individual feature is emphasized NCMMC-2009