1 / 76

A Perspective on Speech Technology Based on Human Mechanism

A Perspective on Speech Technology Based on Human Mechanism. Jianwu Dang Japan Advanced Institute of Science and Technology. Signal Processing. Speech communication in human and in machine. From 60’s, studies on speech spreaded: Scientific way and Engineering way.

brenna
Download Presentation

A Perspective on Speech Technology Based on Human Mechanism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Perspective on Speech Technology Based on Human Mechanism Jianwu Dang Japan Advanced Institute of Science and Technology NCMMC-2009

  2. Signal Processing Speech communication in human and in machine From 60’s, studies on speech spreaded: Scientific way and Engineering way. The former focuses on human functions, and the latter on signal processing. NCMMC-2009

  3. Comparison of HSR and ASR • HSR is a bottom–up, divide and conquer strategy • Humans recognize speech based on a hierarchy of context layers • As in vision, the entropy decreases as we integrate context • Humans have an intrinsic robustness to noise and filtering • HSR: robust articulation; excellent context model; plenty of knowledge • ASR: bad articulation; weak context models; few knowledge NCMMC-2009

  4. How to learn from human NCMMC-2009

  5. Contents of this talk • Discovering and Understanding Human Functions on Speech • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009

  6. Why human can robustly process speech? • Why can human robustly process speech even in some serious adverse environments? Hypotheses and theories • Speech chain is constructed in co-developing the functions of speech production and perception during language acquisition. • Motor theory of speech perceptiondescribesthat the acoustic signal is perceived in terms of articulatory gestures. • Topological mapping between the motor space and sensory space may be the key point for the efficiency of human speech processing. NCMMC-2009

  7. Computational neural models (After Guenther 1996) NCMMC-2009

  8. Human functions in speech Speech recognition/ understanding Intention/ Language Speech Chain Speech Planning Auditory-phonetic mapping Vision Articulation planning Auditory Map Differentiating Motion control Articulation/ Phonation Somatosensory receptor Perception model Aerodynamics Partner speaker Speech signal Speech communication NCMMC-2009

  9. Experiment of transformed auditory feedback Two points: finish all processing with 30ms; keep all individual properties of the speaker NCMMC-2009

  10. Formant difference caused by the TAF P0 is the time applying the TAF. P1 and P3 are the start and end points of the compensation. P2 is the point of the maximal compensation. NCMMC-2009

  11. Compensation for the perturbation /i/ /e/ /a/ /u/ NCMMC-2009

  12. Vocal tract shape of Chinese vowels NCMMC-2009

  13. Extraction of intrinsic structure using similarity • Vocal tract shape is described by 8 points of UL, LL, LJ, T1 to T4, and the velum. • The initial vowel space consists of the vocal tract with 16 dimensions. • A similarity is measuredamongthe vowels in 16 dimensional space, and then a similarity graph is constructed for the vowels. NCMMC-2009

  14. Distribution of articulatory place of vowels in continuous speech NCMMC-2009

  15. Similarity based analysis • An ability to assess similarity lies close the core of cognition (Wilson, et al. 1999) • Geometric models are used in analysis of similarity • Euclidean metric (r=2) provides good fits to human similarity judgments NCMMC-2009

  16. Construction of intrinsic space • A neighborhood keeping graphcan be obtained by minimizing the objective function • The mapping function can be obtained by solving the generalized eigenvalue as • The corresponding low dimensional embedding field can be described in NCMMC-2009

  17. Vowel structure from read speech NCMMC-2009

  18. 3D vowel structure in articulatory space NCMMC-2009

  19. Vowel Structure in Articulatory Space(12 vowels) NCMMC-2009

  20. Vowel Structure in Articulatory Space (11 vowels) NCMMC-2009

  21. Vowel Structure in Articulatory Space(with and without lip protrusion) NCMMC-2009

  22. Vowel Structure in Articulatory Space(with and without lip feature) NCMMC-2009

  23. Homunculus image of the brain NCMMC-2009

  24. Parameters for vowel structure in APS • An affine transform of a logarithmic spectrum can represent the auditory perception parameters (Wang, et al. 1995) • MFCC with 14 dim was used as acoustic parameters in the primary step, the same as that used in articulatory analysis. • Acoustic data were recorded with the articulatory data simultaneously, speech signals of the vowels are extracted from the stable period of each vowel. NCMMC-2009

  25. Vowel structure in APS NCMMC-2009

  26. 3D vowel structure in auditory space NCMMC-2009

  27. Comparison in 3D (Speaker 2) NCMMC-2009

  28. Comparison in 3D (Speaker 3) NCMMC-2009

  29. Relations between motor, sensory and articulatory spaces NCMMC-2009

  30. Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009

  31. Learning approaches • Distribution based learning approach Data dependent • Performance based learning approach Case dependent • What we want to learn? • Human Mechanism based learning approach • Model based learning NCMMC-2009

  32. Human vs. model Typical phonetic target Typical phonetic target • The goal of low layer optimization: learning the planned target • The goal of high layer optimization: learning the typical targets of phonemes and coefficients of carrier model Carrier Model Planning mechanism High layer Planned target Planned target Low layer Articulation by the articulators Physiological articulatory model Observed Articulatory movements Simulated Articulatory movements B: Speech production of model A: Speech production of human NCMMC-2009

  33. Construction of articulatory model based on MRI Extraction of Articulators Articulatory Model NCMMC-2009

  34. Speech synthesis based physiological model Normal speech Emphasized speech NCMMC-2009

  35. Development of the PhAM Tongue Epiglottis Mandible Thyroid cartilage Cricoid cartilage NCMMC-2009

  36. Carrier model for coarticulation Phonetic target Ci Model sketch rci Planned target dci Ci’ Vj’ Virtual Target dvj Gi Vj Vj+1 dvj dvj+1 α β Fig.2 Based on this process, the planned targets are obtained by applying the carrier model on the typical articulatory targets Perturbation model (Öhman)        Lookahead model  (Henke) Carrier Model(Dang et al) NCMMC-2009

  37. Flowchart of the low layer Typical phonetic target Typical phonetic target Look ahead mechanism Carrier Model Planned target Calculated planned target learned Planned target Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements Arrive small difference? N Tuning the planned targets Y Optimal planned targets NCMMC-2009

  38. Flowchart of the high layer Typical phonetic target Typical phonetic target Carrier Model Look ahead mechanism N Calculated planned target Reach threshold? Planned target learned Planned target Y Optimal phonetic targets and coefficients Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements NCMMC-2009

  39. Observation and simulation in the low layer Cross marks: observations; Diamonds: simulations. The ellipses are referred to 95% confidence interval to cover the planned targets. NCMMC-2009

  40. Simulation result of vowels Distribution of observed and simulated articulatory movements of 5 vowels obtained via the whole framework. The blue diamonds denote the simulations NCMMC-2009

  41. Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Investigation and Application of Individual Characteristics • Fusion of Articulatory Constraint with Speech Recognition NCMMC-2009

  42. Factors of the speaker individuals The major factors can be classified as: • Learned factors  social factors: • Dialects • Occupations,… • Inherent factors  Physical aspects: • Age, Gender,… • Physiological situations,… • Morphological of speech organs,… NCMMC-2009

  43. Individuals derived from Morphologies • VT shape varies with articulator movement and generates distinctive phonetic information • Unmovable parts of the VT gives the individual information • The invariant parts of the vocal tract •  the nasal cavity, piriform fossa, laryngeal tube • The acoustic features induced by the above parts NCMMC-2009

  44. Frontal sinus Sphenoid sinus Maxillary sinus Velum Piriform fossa Laryngeal cavity Vocal folds Details in vocal tract shapes Lips Tongue Red: movable Jaw NCMMC-2009

  45. Morphologies of the vocal tract • The nasal and paranasal cavities (Dang, et al. 1994,1996) • The piriform fossa (Dang, et al. 1997) • The laryngeal tube concerned with F4 (Takemoto, et al. 2006) NCMMC-2009

  46. Morphology effects on vowels NCMMC-2009

  47. Evaluate morphological effects Speaker relevancy measurement using Fisher’s F-Ratio [Wolf, 1971] : Feature as subband spectrum. :Speech sample index. :Speaker index. NCMMC-2009

  48. Discriminative score based on F-Ratio • Speaker relevant frequenciesare almost invariant for the five speech sessions. • Low frequency region from 50Hz to 300Hz  glottis • High frequency regions from 4kHz to 5.5kHz  Piriform fossa • High frequency region form 6.5kHz to 7.8kHz  Consonant • Middle frequency region  Linguistic information NCMMC-2009

  49. How to design an algorithm • Enhancing the information around speaker relevant frequency regions • Two ways: • Increase the amplitude of the region • Increase the resolution of the region • What is human action? Increase the resolution  Design non-uniform frequency warping algorithm to emphasize the speaker relevant frequency regions NCMMC-2009

  50. Comparison of frequency resolutions • Uniform : Linear frequency scale (no frequency warping) • Mel : Mel frequency scale (Mel frequency warping) • Non-Uniform: Non-uniform frequency scale (non-uniform frequency warping) Speaker individual feature is emphasized NCMMC-2009

More Related